MacFormat 1997 January

home *** CD-ROM | disk | FTP | other *** search

/ MacFormat 1997 January / macformat-046.iso / Shareware Plus / Developers / EnterAct / EnterAct Stuff / Documentation / hAWK User’s Manual < prev next >

Wrap

Text File | 1996-08-06 | 179.6 KB | 3,614 lines | [TEXT/KEEN]

********************* hAWK User’s Manual ********************* Copyright © 1991 the Free Software Foundation, Inc. You can redistribute or modify this file under the terms of the GNU General Public License as published by the Free Software Foundation (see the file “COPYING hAWK”). font: Geneva 10. Four spaces per tab. hAWK is NOT a stand–alone application: it must be called by some other application. Interaction between hAWK and the calling application will vary according to how well the calling application supports text documents. However, virtually any (C-based) application can add the ability to call hAWK. For details, see “Calling hAWK from your application” near the end of this manual. Applications which support calling hAWK (add yours to the list!): Minimal App (included, with source code) EnterAct, RFEdit, EnterAct Lite You can read this document with any programmer’s editor (you may not see the 4 pictures - they’re not that critical). You’ll need an editor to view the results of a program run if you use Minimal App to call hAWK, since Minimal App does not do anything with text files, and you’ll find that Minimal App, with its minimal level of support, has limited program input options. In fact, calling hAWK through Minimal App shows what hAWK would look like if it were repackaged as a stand–alone application. See “Calling hAWK through Minimal App” (in the “Advanced topics” chapter) for tips on using Minimal App with an editor to run hAWK programs. Major topics are marked with MPW-compatible marks, available in many editors by holding down the <Option> or <Command> key while clicking in the window’s title bar. You can jump to a section heading by selecting the heading in the table of contents and using the editor’s “Enter Selection”/“Find Again” commands. The “Active index” at the end of this manual is suitable for on-line use, consisting of line numbers rather than page numbers; to jump to the line for a reference in the index, select the corresponding line number and use the editor’s “Go to” command. If you change the content of this manual you will throw off the Active index, and will lose the marker locations also if the editor doesn’t manage MPW–compatible marks. However, feel free to add or delete markers, or change the font. Why bother to learn hAWK? • Many editing and formatting problems that crop up in the life of a C programmer can be solved with a simple hAWK program. Now you have a choice—grind out a series of mechanically–repeated key strokes, or dash off an elegant little program. And when it comes time to solve a problem, a typical hAWK program can be run with two mouse picks and a press of the <Return> key (or even a command line). • On the Mac alone, there are versions of AWK that run under the MPW shell, under A/UX, and now with hAWK there is a version that’s handy to use in conjunction with THINK C. Never mind all the DOS and Unix implementations—even on the Mac, hAWK is a widely–used language. You’re not learning a white elephant, here. • Need to prototype a “little” language? Try out an algorithm? Looking for an introduction to C that comes with air bags? This is it. For a sampling of what hAWK can do, see “About the supplied programs” below. Contents ----------- Introduction Installing hAWK Where to go from here About hAWK From AWK to gAWK to hAWK What’s missing What’s new The calling application A typical hAWK run Running hAWK programs The setup dialog Concurrent and immediate modes Selecting your program Selecting input for a program Setting variables Library files Showing the results Saving the setup for a program Cancelling a run Standard input and output About the supplied programs hAWK program structure From start to finish Grouping and breaking lines The command line and ARGV[] Variables and constants Variable names and types Constants Record and field variables Built–in variables Local variables in functions Setting variables on the command line Conversion between numbers and strings Arrays Patterns Patterns and actions BEGIN and END Expressions as patterns String-matching patterns Regular expressions Compound patterns Range patterns Summary of patterns Actions Introduction A preview of “print’ Expression operators Built–in numeric functions Built–in string and file functions Control-flow statements Empty statements User-defined functions Output The “print” statement The “printf” statement Output into files Closing files Input FS, the input field separator RS, the input record separator The “getline” function The “hAWK” function Advanced topics Other ways of specifying input files Beyond input records Calling hAWK through Minimal App Calling hAWK from your application What and how Getting started Add two calls in your code A minimal version Callbacks, and showing results Using a command line Modifying hAWK Introduction hAWK THINK C project Source Libraries Active index ------------- Introduction ------------- hAWK is AWK adapted for the Macintosh, a small programming language which is well-suited to jobs involving text manipulation and pattern recognition. hAWK is not a stand-alone application, but is rather a CODE resource with a specific simple calling interface (called a "Drag_on Module"), and it is invoked by selecting "hAWK" from a menu in an application that can call Drag_on Modules. This manual will explain in more detail what hAWK is, and show you how to run hAWK programs. There are many useful programs suppled in the "hAWK programs" folder, each with complete instructions at the top so you can try them out as you go along; they range from very simple to rather complex, general purpose to very special purpose, and illustrate the wide range of hAWK’s abilities, from counting lines in a file to cross–referencing your C source. The chapter below entitled “About the supplied programs” provides an overview of the programs in the “hAWK programs” folder. These programs are not just useful as “examples to learn from”—they are, for the most part, nontrivial, and supply real answers to the daily problems of a C programmer. What is hAWK really? hAWK is what C could be if you weren't in a hurry. hAWK programs are relatively small, look rather like C code, and rely on powerful built-in capabilities and commands—capabilities like automatic reading of input files on a line-by-line basis, commands such as "gsub" which is, just on its own, as powerful as Grep. The focus is on text, but the text can be just about anything—the sample program “$Print_MENU_Resource”, for example, will take the hex representation of a MENU resource as retrieved by Read Resource and format it to be human–readable. The primary difference between hAWK and other versions of AWK lies in the method of running programs; hAWK’s setup dialog allows you to run programs with just a few mouse clicks, with typing needed only if you wish to assign initial values to variables before a run. This is mainly because hAWK can take advantage of the window and file handling abilities of the application that is used to call it, to offer the options of taking input for the hAWK program from text in the front window of the calling application, or from the list of files selected for multi–file operations. These generalised input specifications, “whatever’s in the front window” and “whatever’s selected for multi–file operations”, eliminate the need to type in a list of file names for a program to use as input. And since each program can remember the general input method you have selected for it, repeated runs of a program are reduced to: bringing the input to hAWK’s attention, either by bringing a text file to the front or by selecting files for multi–file searching; and then running the program with three mouse clicks. This all makes hAWK as easy to run as a macro language, and since AWK is a widely–used, full–featured programming language you should find it well worth the effort of learning. Although running hAWK with the setup dialog is normally the easiest way, hAWK can also be called with an old-fashioned command line. You can’t pass any frontmost text to hAWK this way (since the frontmost text will be the command line), but you can typically specify all files selected for multi-file operations as input, or specify one or more input files using full path names. If you want to implement an application that supports calling hAWK via a command line, please see the section "Using a command line" in the "Calling hAWK from your application" chapter below. If you just want to use the command line approach, see the documentation supplied with your application that calls hAWK for the details on how to do it. --------------- Installing hAWK --------------- If you can read this, then you’ve installed hAWK, since it is being shipped in compressed form these days. As a reminder, hAWK should be inside your "Drag_on Modules" folder, and this folder should be in the same folder that contains the calling application, at the same level. The "hAWK programs" folder should also be in the "Drag_on Modules" folder, and this manual can go anywhere. To verify that hAWK has been installed, start up an application that can call hAWK and then check the menus; you should see “hAWK” as one of the items. Select “hAWK”, and the setup dialog for hAWK will appear. Venture on ahead fearlessly if you like, armed with the magic incantation that holding down the <Command> key while typing a <period> will interrupt any running hAWK program. ------------------ Where to go from here ------------------ Read straight ahead here until you’ve tried out a few hAWK programs and are comfortable with the overall approach to running them. The supplied programs in the “hAWK programs” folder are worth exploring to get a feel for what hAWK can do—and you’ll likely find that several of them provide answers to problems little or big that you regularly face. The remainder of this manual delves into the inner workings of hAWK, necessary reading if you want to write your own hAWK programs (and who could resist?). If you make use of the markers in this manual for the chapter and section headings, and the active index at the end listing topics, you’ll be able to browse around almost as easily as with a printed book. This is a good–sized manual, and if you try to read straight through it at one sitting you’ll probably hurt your head. Just amble along at a gentle pace, and when ideas or questions pop up, you’ll find it well worth the effort if you take a moment to write a one or two–line hAWK program to try the notion out. Running a hAWK program takes just a few mouse clicks. The easiest way is with “$RunClip” (see chapter H). You can, if you wish, print this manual yourself. Aha, but what about that index, which lists line numbers rather than page numbers? Thought you might ask that—what you want, then, is a version of this manual with line numbers added at the beginning of each line. An ideal job for hAWK! 1 Use a “Save As” command to save this manual under a different name, such as “hAWK Manual” (or save it under the same name but in a different folder): 2 Select “hAWK” from the calling application’s menu, and the setup dialog will appear; select “$AddLineNumbers” from the “Main program:” popup menu at the top; pick the “Select input file” option from the “Take input from:” popup, and use the standard Open dialog that appears to select the copy of this manual that you just created: 2a Click “Run” and wait a bit....and you’re back in the calling application: 3 Open the copy of this manual—if you left it on–screen while running hAWK, choose “Revert” to see the changed version (you can force Revert to be enabled by typing one character in the window): 4 Print the result —change the font first, if you like. 5 Note, to include the pictures you will have to use ResEdit to copy them from the original manual to your copy of the manual, and use EnterAct to print. They deal with the setup dialog only, and you shouldn’t miss them much if you don’t bother. A very readable description of AWK (excluding the Macintosh variations of hAWK) can be found in "The AWK Programming Language" , Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger, Addison-Wesley, 1988. ISBN 0-201-07981-X. on the "Languages" or "unix" wall of your favourite bookstore. A more relaxed, though less ambitious, introduction can be found in "sed & awk" Dale Dougherty O’Reilly & Associates, Inc., 1991. ISBN 0-937175-59-5. The coverage of regular expressions is especially sympathetic. ---------- About hAWK ---------- From AWK to gAWK to hAWK hAWK is a Macintosh version of AWK, a pattern-recognition and data-manipulation language that is popular on unix systems. This version of hAWK is a modification of GAWK, the GNU Project's implementation of the AWK programming language, which differs in only minor ways from "classic" AWK. "hAWK" will be the name used below, except where differences from Gawk or AWK need pointing out. AWK has a venerable history, going all the way back to 1977 when messrs Aho, Weinberger, and Kernighan developed it at Bell Labs to fill in some small holes in Unix. The idea then was to write one or two–line programs to solve simple pattern–matching and text or number transforming problems—programs so small that you wouldn’t even bother to save them, just type them in on the fly, right on the command line. Over the years, users have pushed the limits of AWK, and many features have been added (user–definable functions being the nicest), and now multiple–page AWK programs are commonplace. GAWK is a Unix/IBM version of AWK, developed around 1986 by Paul Rubin and Jay Fenlason and copyright by the Free Software Foundation. It adds some useful enhancements to AWK, dealing mostly with files and variables. hAWK is essentially GAWK adjusted for the Macintosh, with the addition of a dialog interface to take advantage of windows and mice. If you wish to distribute hAWK, by the way, you should note that it is governed by the Free Software Foundation’s copyright restrictions (not too horrible) which you can find in the file “COPYING hAWK” in the source code folder for hAWK. What’s missing Pipes are missing. Pipes take a full–fledged shell to run, and most applications aren’t up to it. Since hAWK is packaged as a CODE resource to be called by any old application, pipes had to go. Similarly, the “system” command (which allows one to call other shell commands from within an AWK program) has been dropped. What’s new The interface is new. No more command line—most hAWK programs can be run with just a few mouse clicks, and typing is needed only if you want to set the value of variables before running the program. (Note a command line is supported though.) There are seven new built–in string functions, “lookup”, “sort”, “time”, “prompt”, “progress”, “getclip”, and “putclip”, described in “Built–in string and file functions” in the “Actions” chapter. Some new file and directory functions are also described there. The “lookup” function returns the type of a C term as an integer code (#define = 1, variable = 2, etc), useful when doing cross-referencing. It relies on the calling application for this diagnosis, so hAWK programs that use “lookup” should be called only through applications which support it (Minimal App doesn’t). The “sort” function is provided to (mostly) make up for the lack of a shell sorting function. It’s fast, and can do ASCII, numeric, or dictionary–order sorting of an array, in forward or reverse order. The “time” function produces the current date and time, to the second. The “prompt” function prompts you with a dialog to enter some text, and returns what you enter as a string, as in X = prompt("Please enter a value for X:") The “progress” function allows you to show (and update) a message while a program is running. The “getclip” function returns a string holding the calling application’s current (up to the second) private clipboard. This can be used to pass instructions or data to a hAWK function while it is running concurrently with your application (more on this, needless to say, below). Similarly, putclip puts a new string of text on the clip. As a partial replacement for the “system” command, any hAWK program can call any other hAWK program as a “subroutine”, via the “hAWK()” function. Using this function, a program can generate a special-purpose program and immediately execute it (eg $MFS_SuperReplace), or selectively execute a series of programs (eg $Chain). It also allows you to type in and run programs without saving them first (eg $RunClip). This function is decribed in its own chapter, “The hAWK function”. Three built–in variables have been added; RUNERR, STDPATH, and TIME. See “Built-in variables” in the “Variables and constants” chapter for details. hAWK uses the concept of standard input, output, and error, but strictly in the form of files with the fixed names $tempStdIn, $tempStdOut, and $tempStdErr. These files are created and written to as needed, and can be found in the same folder that contains your “Drag_on Modules” folder after you’ ve begun running hAWK programs. These are temporary files, and will normally be overwritten by each hAWK program run. The regular expressions implemented in hAWK are full regular expressions, with the ability to tag subexpressions, match word boundaries, ignore case, and deal with multi–line strings. Just about anywhere else in this world, you’ll find either full regular expressions or the ability to tag subexpressions, but not both. One minute you want the “or” operator, the next minute you want to tag something—it gets rather frustrating. There is absolutely no good reason not to allow both together, so in hAWK you’ve got them. Speaking of gripes, most Grep’s will limit you to a single line—that’s not just frustrating, it’s downright crippling. (By the way, another major improvement over Grep is that in AWK/hAWK your regular expression can be the string resulting from the evaluation of one or more variables, eg if (no_plus_or_minus) integer_pattern = digits; # digits == "[0123456789]+" else integer_pattern = plus_or_minus digits; # plus_or_minus == "[+-]?" —and a pleasant side–effect is that regular expressions can be very readable if you want.) For the details, see “Regular expressions” in the “Patterns” chapter. If the calling application supports the notion, your hAWK programs will by default run concurrently with your calling app. This means you start up the hAWK program, and then go back to working in your application (or background it and work somewhere else) until the hAWK program is done. The “prompt” and “progress” functions are non-functional in this concurrent mode, so you can run programs in the “immediate” mode, which supports “prompt” and “progress” by holding down the <Shift> key while selecting “hAWK” from the calling application’s menu. In immediate mode, you will be locked out of the calling application until the hAWK program ends. Programs will run more slowly in concurrent mode (the speed drop being slightly greater if you put the calling application in the background), but this is usually more than compensated for by being able to carry on with other things, rather than just sit there watching the watch cursor. The running hAWK program usually doesn’t affect application performance very much. For more about this, see “Concurrent and immediate modes” in the “Running hAWK programs” chapter. The calling application Any C-based application can call hAWK and other Drag_on Modules, as the source code for Minimal App demonstrates. The level of interaction between hAWK and the calling application is up to the author of the calling application, and can vary more or less according to the following table: Level Support for interactive features ---- ----------------------- minimal (none; no result showing, input options limited to one specific file) basic text pass front text window as input option, show stdout after a run full text basic text, and pass list of selected files as input option full full text, diagnose the type of a C code term, pass the clipboard If the application you are using provides only minimal support, then some extra manual steps are needed to persuade a hAWK program to take input from the current front text file or a list of files, and to view the results of a run; see “Calling hAWK through Minimal App” in the “Advanced topics” chapter for some tips on this. The discussion there is “advanced” only if you want to understand all the details—you can use the methods described there by rote (for example, if it says paste this bit of code into the top of a program and you’ll have support for taking input from a list of files, you can do it now and worry about how it works later). ------------------ A typical hAWK run ------------------ Have you installed hAWK yet? If not, now would be a good time (see above). We’ll assume that you’re calling hAWK through an application that supports passing all or part of the front text window as input options, and showing stdout after a run, to make life simpler. If you don’t have such an application, you can use Minimal App in conjunction with whatever editor you are using to view this file, as described in the “Advanced topics” section “Calling hAWK through Minimal App”. One of the programs supplied with hAWK is “$EnumSwitch”, which takes a list of enum constants and generates a “switch” statement based on them. It’s contained in the folder “hAWK programs”, which is inside the “Drag_on Modules” folder—you might like to take a look at it first.... OK, here we go: first, move this window on your screen so that you can see the next few lines while the hAWK setup dialog is in front (select hAWK now from the appropriate menu and Cancel to see where it appears). Now select the following line of text: {first, second, third, fourth, twilightZone = -99} -is it highlighted? Good. Now, select hAWK from the menu; when the dialog appears, select “$Enumswitch” from the top popup menu called “Main program:”, and finally, click on the Run button or hit <Return> on the keyboard. You should be back in the calling application now, with a switch statement coming up in a window called “$tempStdOut”. hAWK took the line that you highlighted above, stripped it down, built a switch statement out of the words, and wrote the results to the disk file “$tempStdOut”. The calling application is now showing you the resulting file, with contents selected and ready for pasting into your source code. Most hAWK programs can be run this easily. Now, the full story. ------------------ Running hAWK programs ------------------ The setup dialog When you select hAWK, the above “setup” dialog always appears first. A typical program run consists of: setting up the input to be ready for hAWK; selecting hAWK to see the setup dialog; selecting the program to run from the “Main program” popup menu; and hitting the Run button. If you have variables in the program that need to be set just before running the program, then you can set up to 10 variables by using the dialog that appears when you click the “Set variables” button. The input option, variable settings, and names of any associated libraries can all be saved with a hAWK program via the “Save settings” button, so that when you run a program again you‘ll need to adjust the setup only for things that have changed (typically only the values to be initially assigned to variables, if anything). Concurrent and immediate modes With most little languages, when you run a program that’s all you do—run the program. No continuing to work in your primary application, let alone switching to another application. In the rare case when you want hAWK to completely take over your Macintosh, locking you out of the calling application, hold down the <Shift> or <Option> key while selecting “hAWK” from the calling application’s menus. If the program uses the “prompt” or “progress” functions, it will be necessary to run in this “immediate” mode, since they just return null results in the “concurrent” mode. In all other cases, just select “hAWK” from the calling application’s menus without holding down the <Shift> or <Option> key, and if the calling application supports it, you’ll be returned almost immediately to your application, able to carry on working there while the hAWK program runs at the same time. This “concurrent” mode of running programs does not greatly slow down the calling application or any other application that you switch to. The hAWK program itself will run more slowly than in immediate mode, often taking about 50% longer—but if you don’t need the results in a huge rush, stick to the concurrent mode and just forget about the hAWK program until it winds up with a beep. While a hAWK program is running concurrently, you won’t by able to run any additional Drag_on Modules. This is because they all use the same standard output file ($tempStdOut), and a fight could develop over who gets to write to it. While a hAWK program is running concurrently, you will not be able to save to any files that hAWK is using. Regular input files are accessed only one at a time, and the standard input/output/error files will normally be “busy” from beginning to end of the run. In addition, any files being read from or written to via redirection (see “Output” and “Input” chapters) will not be writeable. However, you will be able to open any file that hAWK is using to take a look at it. With a lengthy program, you can check in with hAWK now and then by opening (or reverting) $tempStdOut to get a snapshot of how things are progressing. See the supplied program “$LogDaemon” for an example of a hAWK program which idles unobtrusively underneath your calling application, waiting to take special action when you copy a specific instruction to the application’s clipboard. A “daemon”, by the way, is an invisible, powerful spirit with your best interests at heart. It “possesses” your Macintosh, in a nice way. And the name is a bit more entertaining than the plain old “forks” and “threads” etc. Concurrent execution is currently supported by: EnterAct. Selecting your program The “Main program:” popup at the top of the setup dialog lists all text files in the “hAWK programs” folder whose names begin with a dollar sign ($). This list is rebuilt each time you call hAWK. If a program is not listed in the popup, you can still run it by picking “Select unlisted program”, the first item in the “Main program” popup, and then using the standard Open dialog that appears to select the program—note it could be in another folder, or in the “hAWK programs” folder but not shown in the popup simply because its name doesn’t start with a “$”. You can avoid clutter in this popup by starting the names of only your most popular hAWK programs with a “$”, so that other less–frequently used programs won’t be shown in the popup—if they are in the hAWK programs folder, they will still be close at hand. Selecting input for a program This is one of hAWK’s nicest features, allowing hAWK to interact with the calling application to provide quick input file specification. Two additional ways of specifying input files, not listed in the “Take input from” popup, are described in the “Advanced topics” chapter, in “Other ways of specifying input files”. Under the “Take input from” popup menu, the options “Front text selection” and “All of front text” refer to the text window that happens to be in front just before you call hAWK from the calling app’s menu. According to what you select here, all or just the selected part of the text in the front window will be written to a temporary file called “$tempStdIn”, and passed to your program as the input file to use. If your program is to be run using one of these options, bring the text window containing the text to be used as input to the front just before calling hAWK, and if you’ll be using the “Front text selection” option, you should select the text as well. For an example, see “A typical hAWK run” above, where this manual itself served as the front text. The “MFS selected files” option in the “Take input from” popup refers to a list of files selected in the calling application for multi–file operations (typically this list is used mainly for multi–file searching in the calling application, and you construct it by placing check marks or bullets • beside file names—see the calling app’s manual for details). With this option selected, all files selected for multi–file operations will be passed to the hAWK program as input. This means you can set up a list of files in the calling app, and then have your hAWK program take its input from those files, from one file to hundreds. One limitation of this approach is that you can’t specify the exact sequence in which the files will be dealt with. With many programs, this is not a problem (multi–file search and replace, for example). To treat input files in a specific order see “Other ways of specifying input files” in the “Advanced topics” chapter. The “Select input file…” option allows you to use a standard Open dialog to pick one specific file to use as input for a hAWK program. As with all other aspects of the setup dialog, if you click “Save settings” the name of the file you select will be saved with the program itself, and restored for the next run. Aside from “Select input file…”, input options will not be shown if they are not currently available. In rare cases, you may need no input at all for your program. To ensure that no input is passed, pick the “Select input file…” input option, cancel the Open dialog that appears, and then click the “Save settings” button. The input option for your program will thereafter read “Select input file…”, as though imploring you to pick one, but no input will be sent to your program. It’s harmless if input is sent to a program that doesn’t want any, the only penalty being time lost if a massive amount of input is accidentally ordered along for the ride. Setting variables The “Set variables” button allows you to preset the values of variables just before running a progam, without having to edit the program itself. As you can see from the picture, it’s a simple matter of typing the variable name, followed by an “equals” sign, followed by the value of the variable, either a number or a string. Quotes should not be used to surround strings; just enter the string itself. Any spaces between the “=” and the value will count as part of the value, so normally you should enter the value with no spaces between the equals sign and the value. For example, find =spot and find = spot produce different results. Spaces are optional between the name of the variable and the equals sign. The limit on the length of the variable assignment, including the name of the variable, is 100 characters. Up to 10 variables may be given values this way. Special characters such as tabs and returns can be placed in a string by using the standard escape sequences familiar from C, eg find =\tspot\n assigns to “find” the string consisting of a tab, followed by s-p-o-t, followed by a carriage return. You can also assign the value of a (dynamic) regular expression using the “Set variables” dialog, for example find =\.#[A-Za-z_]+ (never mind what it means for now) —note there is no need to enclose it in forward slashes, and many characters must be escaped with a backslash if you want them matched literally (the section “Regular expressions” in the “Patterns” chapter explains the nuances). Clicking the “Save settings” button will save your variable assignments for subseqent runs. Hence you’ll need to use the “Set variables” dialog only when the preset value of some variable changes. If variable presets exist for a program then the “Set Variables” button will acquire a gray outline as a reminder that some variables may need changing before running the program. With some programs (such as $CompareFiles) you’ll almost never change the preset variables, but with others (such as $MFS_SuperLister) you’ll want to change one or more variables before almost every run. Library files Technically, this is an advanced topic, but it’s simple to use. If you develop some general–purpose functions, such as sorting routines, that you wish to use in several programs without duplicating the function definitions within each program, you can save the functions in a separate file and add that file to each main program as a library. The contents of the library file are simply appended to the contents of the main program before running it, so the library can in fact contain any valid hAWK statements. However, to preserve sanity, libraries should be restricted to just functions. To add a library file to a main program: 1 Use the “Main program” popup to select the program 2 Use the “Select library…” item in the “Libraries:” popup to add the library by using the standard Open dialog that appears. 3 Clicking the “Save settings” button will preserve your selection of libraries for for subsequent runs. To delete a library, select it using the “Libraries:” popup. One sample library is included, in the file “SortLibrary”. It is not used in the sample programs, it’s just an example (PLEASE NOTE hAWK has its own built–in sort function, which is very fast). Little is lost if you follow the policy of not using libraries—programs are easier to read if all the code is in one place. Showing the results Output from hAWK programs is produced by “print” or “printf” statements, which send their output to the file “$tempStdOut” unless you explicitly redirect it. For example, print "some text" will print the string "some text" to $tempStdOut. The file $tempStdOut is created and managed for you, and most hAWK programs will send at least some output to this file. If you would like to see this file after the program is finished, put a check mark in the “Show stdout” checkbox in the setup dialog just before running the program. When the program is done, the calling application will then show you the $tempStdOut file in a window, if it is able to. If the calling application doesn’t support showing stdout, you’ll have to manually Open or Revert the $tempStdOut file using your editor (for more on this see “Calling hAWK through Minimal App” in the “Advanced topics” chapter). Place a check mark in the “Select all of stdout” check box to have all of the output in the $tempStdOut window selected at the end of the program run. This is handy if you’ll be wanting to copy the entire output and paste it in elsewhere. Saving the setup for a program The “Save settings” button saves away your selection of options for a program, so that they will be restored for subsequent runs of the program. These options are saved with the program itself, in a special resource. The saved options are: 1 The names of any libraries associated with the program 2 Names and values of any preset variables 3 Your choice of input option, including the input file name if you have used the “Select input file…” option to pick a specific file. 4 Your output options, in the checkboxes “Show stdout” and “Select all of stdout”. During the first run of a program that you have written, you should set up the options you want and then click the “Save settings” button. Subsequent runs will then consist of just these steps: 1 Select “hAWK” from the calling application’s menu 2 Use the “Main program” popup to select the program 3 Use the “Set variables” button if needed to put in new values for variables (many hAWK programs don’t need this) 4 Click the Run button. Occasionally, you may want to run a program using a different input option, for example run it using “MFS selected files” rather than “All of front text”. This is simply a matter of selecting the new input option from the “Take input from” popup just before running the program. If you want the input option to be permanently changed for the program, click the “Save settings” button after picking the new input option. Cancelling a run To cancel a hAWK program, hold down the <Command> key while typing a <period>. Program execution should cease within one second. ------------------ Standard input and output ------------------ Drag_on Modules such as hAWK and Read Resource use three disk files to communicate with you and with the calling application. These text files carry the burden of standard input/output for Drag_on Modules. If a Drag_on Module requires a large chunk of input that is not already in an appropriate disk file, the input will be written to the standard input file “$tempStdIn”, and all normal output from Drag_on Modules is, unless you specify otherwise, sent to the file “$tempStdOut”. If errors pop up while the Drag_on Module is running, error messages will be written to the file “$tempStdErr”. These files are all created and written to automatically as needed, and can be found in the same folder that contains your “Drag_on Modules” folder. The file of main interest here is $tempStdOut, which typically holds the results of a Drag_on Module run. Drag_on Modules don’t show you this file, but can request that the calling application show it to you. This is always the case with Read Resource, and is optional with hAWK—it depends on whether you put a check in the “Show stdout” checkbox in the setup dialog. All of the supplied hAWK programs that write output to $tempStdOut have saved settings that include putting a check in this box. Because the results of Drag_on Module runs are by default written to a fixed text file, you can easily pass the output from one run to the input of another run. For example, Read Resource creates a formatted text version of a resource and writes the results to $tempStdOut, which is then shown to you by the calling application. You can then call a hAWK program to further process this output, by leaving the $tempStdOut window in front and having the hAWK program take its input from the front window (pick the “All of front text” option from the “Take input from:” popup menu). And you can pass the output from one hAWK program to the input of another in the same way. A Drag_on Module can only request that the calling application show you the $tempStdOut file, but whether or not it does so is up to the author of the calling application. If it doesn’t, you’ll have to Open or Revert $tempStdOut yourself in order to see the results. The contents of $tempStdOut are indeed temporary, and will be overwritten by the next hAWK program, or indeed any other Drag_on Module, that you run. If you want a permanent copy of the output from a program, use “Save As” to save $tempStdOut under a new name, or copy the contents to a working window. hAWK always takes input from a file, and if you are using one of the “front text” options for input then hAWK will write a copy of the front text to $tempStdIn before running your program. Output from hAWK programs, which is generated by “print” and “printf” statements, can be explicitly redirected to any file, but if no redirection is provided then by default the output from the program is sent to $tempStdOut. The file “$tempstdErr” will hold error messages if problems pop up while running a program. Sometimes you’ll want to take input directly from the file $tempStdOut, without bothering to use the above method of opening the file and bringing its window to the front. It is perfectly OK to select $tempStdOut as the input file using the “Select input file...” option under the “Take input from:” popup. The contents of $tempStdOut just BEFORE the run will be used as the input, and input from this “old” version of $tempStdOut will not be affected by anything you write to $tempStdOut during the execution of your program. Actually, your old $tempStdOut will be renamed to $tempOutAsInput just before the run, and the file name your program receives will also be changed. This bit of suberfuge is necessary since it is not possible to randomly read and write the same file without things getting horribly confused. ------------------------ About the supplied programs ------------------------ For the most part, the programs you’ll find in the “hAWK programs” folder do useful things (from the point of view of a C programmer), with just a few of them being of the traditional “completely useless but illustrating some basic point” kind that are often foisted on innocent customers by authors who have run out of steam before writing the manual. There are nearly as many categories of supplied programs as there are supplied programs, so the following list with brief descriptions is in simple alphabetical order. The descriptions are brief here because each supplied program contains a detailed explanation of what it does and how to use it, at the top. “$RunClip” provides a handy way to run small programs as you explore hAWK, without having to save them to disk first. You’ll find instructions below, and at the top of the $RunClip file. Unless otherwise mentioned, a program sends its output to the file $tempStdOut, and you will be shown the contents of this file by the calling application at the end of the run (if it is able to do so). Most programs will accept input from any source, but then again most programs are especially useful with just one or two input sources. $EnumSwitch, for example, expects a comma–separated list of enum constants as input, normally provided by selecting the enum constants in a source code window and taking input for $EnumSwitch from the selected front text. Running this program on a batch of MFS selected files is possible, but wouldn’t produce very useful results. Once you understand roughly what a program does, you should be able to judge what sorts of input are appropriate for it. The detailed instructions for running a program can be found at the top of the listing for the program itself, and you should read through those before running a program for the first time. For example, with $MFSLister you have to tell it what string to search for, and this is done by setting a variable with the “Set variables” button. Programs which make essential use of the “progress” or “prompt” functions should be run in “immediate” mode (see “Running hAWK programs”, section “Concurrent and immediate modes”). To run a program in immediate mode, hold down the <Shift> or <Option> key while selecting “hAWK” from your application’s menus. Programs that should be run in immediate mode are marked with (IMM) just after the program name below. $AddLineNumbers: will add line numbers to a file. Takes input from one specific file, and overwrites the contents of the file. Doesn’t number blank lines. $Chain (IMM): allows you to run one or more small canned programs on your input, the first program being executed using whatever input you specify, and the following programs if any taking their input from stdout. You type in the names of the programs to run in a dialog box, and they are executed from left to right in the order you typed them. Effectively serves as a “library” of small tasks. Illustrates using the hAWK() function to execute a sequence of programs, repeatedly taking input from stdout, and the “prompt” dialog box. $Comments: extracts lines that contain C comments. Or rather, at least all lines that contain comments. $CompareFiles: prints differences between two versions of a file; for use with the “MFS selected files” option. Has a couple of options, but should almost always work fine with the defaults—see instructions if results seem suspicious. Lengthy miscompares (over 100 lines) will cause it to bog down. Demonstrates doing everything with functions rather than pattern–action blocks. $DefineSwitch: generates a “switch” statement, with cases created from a list of #defined constants. Normally takes input from the selection in your front text window, output is shown selected in $tempStdOut for copying to your working window. $EchoFileNames: for use with the “MFS selected files” option, creates a list of the file names that were selected. $EchoFullPathNames: like $EchoFileNames, but generates full path names in the general form “Disk:folder:folder1:...:folderN:filename”. Full path names are required when redirecting input and output of hAWK programs. $EnumSwitch: like $DefineSwitch, but generates the cases for the switch from a comma–separated list of words, typically enum constants. Initializations for any of the constants are ignored. $ExtractExternRefs: list all C declarations encountered that begin with “extern”. Fast and simple, but will stumble if it encounters “extern” as the first word in a comment. (Excercise: steal the comment–skipping code from $XRef to fix this little problem). $FilesInOrderTest: discussed in the “Advanced topics” chapter way down below. Demonstrates the technique of taking input from an arbitrary list of files, the list itself being the sole input you pass to the program. $FindSetVolEtc: an example of a small program knocked off in a minute to solve a specific search problem. Searches for a list of specific terms, prints the file name and line number where found, together with the context of the find. $FrequencyWord: lists unique words in one or more documents, in declining order of frequency. Demonstrates associative arrays and the sort command. A companion to $WordFrequency. $List_Potential_C_Locals : feed this the body of a C function, and it will return a list of candidates for declaration as local variables within the function. Contains a near-complete lexical analyser for C, and produces best results if the calling application supports the “lookup” function. $Lockout (IMM): a pathological excess. MUST be stopped with <Command><period>. Displays a marquee–style message in Chicago or “giant” while you go to lunch. Trivial, but the code itself is worth looking at (it can archive giant messages to files, demonstrates two–dimensional arrays, implements severe abuse of the progress() function). You can set the message before running, by changing the “message” variable. Some other options available. $LogDaemon: the only supplied program that must be run in concurrent mode only. It waits around until you copy the (almost) word "logit", flashes the menu bar to acknowledge, and then will append the NEXT bit of text you copy to a specific file, together with a date stamp. Then another flash to signal that it’s done. This program runs until you type <Command><period>. See instructions before using, since you’ll need to change the name of the log file. $LongestLines: will print out a list of the longest lines in one or more files. Use “Set variables” to set how many lines to print, and how many spaces in a tab before running. Properly converts tabs to spaces for calculating lengths, illustrates several basic string functions. $LookupTest: a demonstration of the lookup() built–in function. $MFSLister: searches for a string or a regular expression (restricted to checking one line at a time). Prints file name and line number where found, with optional printing of the line containing the match. $MFS_SuperLister: searches for a regular expression or plain text involving variable white space, can match it even if it spans a variable number of lines (try that with Grep!). Lists file name and line where found. It’s up to you to provide the text or regular expression. The innards are much like $MFS_SuperReplace. $MFS_SuperReplace: multi-file search and replace, searching for a regular expression or a string of literal text that can span a variable number of lines. Replacement text can replace or extend the pattern found. Alters the original files, fully documents changes to stdout. Demonstrates using the hAWK() function to selectively alter and execute a program, handling a variable number of input lines at once in a “rolling buffer”. $Print_MENU_Resource: given the result of Read Resource on a MENU resource, this program prints a nicely–formatted version of the menu. A sample for doing your own custom resource or data formatting and content verification, including all of the necessary basic functions for doing so. $Print_MPSR_1007: given the result of Read Resource on a “MPSR 1007” resource (ie marks for a text file), prints out a nice version (see also $Print_MENU_Resource). $printNF: trivial, prints the number of fields in each input line. $ProgressTest, $PromptTest (IMM): demonstrate the prompt() and progress() functions. (The ultimate progress() example is $Lockout; for a nice little prompt() example, see $YoungMath). $RoughIndexer: if you dream of automatically generating an index, you can start here. $RunClip: for short, disposeable programs to be run concurrently (note that $Type&Run only runs in immediate mode). The calling application must support passing its clipboard to hAWK (eg EnterAct). Create your program in the calling app, Copy it, bring input to hAWK's attention (eg front text or a multi-file selection), then call up hAWK and select and run $RunClip. Your copied program will be saved to the file “$hAWKTempProgram”, and then executed using the built-in hAWK() function. $SortTest: a test of the built–in sort() function, doing dictionary order. For a real use, see $WordFrequency. $SortTest_Nums: a sort() test on numbers. Uses rand() to generate the numbers. $StubFunctions: given a list of C function prototypes, generates empty function shells for the function definitions. $TabsToSpaces: converts tabs to spaces in one or more documents, replacing each tab by the appropriate number of spaces (anywhere from 1 to “spaces_in_tabs”), consistent with the tab interpretation of THINK C et al. You set the number of spaces in a tab with “Set variables”, and also whether to overwrite the original file or make a copy with a new name. Demonstrates some basic file–handling methods $Time: just prints out the time, using the TIME built–in variable, and the time() function for comparison. $TwoColumnsRight: given a list of numbers in two columns, right–justifies the numbers in the columns. Demonstrates dynamically building a printf() format string with variables and string concatenation. $Type&Run (IMM): for short, disposeable programs, use the dialog box presented by this program to type in and run your one or two-liner. Since <Return> means “OK” in the dialog, use <Command><Return> to advance to a new line. Illustrates using the hAWK() function to save and execute a program. $Uppercase: changes the first letter in each input field to upper case if it is a lower case letter. Uses match(), sub(), substr(). $Whazzat: translates C declarations into English. Works best if the calling application supports the “lookup” function so that special terms in your declaration (typedefs, struct tags etc) can be diagnosed. Illustrates using functions instead of pattern–action blocks, retrieving tokens with string functions while parsing, reformatting long lines for output. $WordFrequency: a “classic” use for AWK - print sorted list of unique words in the input, together with the number of times each word is used. $XRef: generates file and line number listing for your choice of terms in C source code. Illustrates the hAWK() function, sorting. The calling application must support the “lookup” function (see “Built–in string and file functions” in the “Actions” chapter). $XRef_Full: like $XRef, but doesn’t skip comments and strings. $YoungMath (IMM): demonstrates the prompt() function while urging you to add numbers. --------------------- hAWK program structure --------------------- From start to finish A typical hAWK program run progresses as follows: 1 From the hAWk setup dialog, specify the main program to be run, add any library files that go with it (optional), specify initial values for variables (optional), and build a list of input text files for the program to work on (optional, but almost always included). 2 Collect the main program and libraries together into one big program. Reduce it to a form more suitable for interpretation. Assign initial values to variables if you have provided any. The list of input files is made available to the program,in the array ARGV[] of file names. 3 Execute the program: by default, hAWK automatically reads the text from the input files into memory, one “record” at a time (the default is that a line is a record). If a record matches one of your specified patterns, then action is taken. Statements may optionally be executed before and after the input is dealt with. Schematically, a generic hAWK program looks like #An abstract hAWK program: BEGIN {beginning statements} pattern1 {action statements for pattern1} ... patternN {action statements for patternN} END {ending statments} (--supporting function definitions--) and the corresponding program execution proceeds as follows: • execute any supplied BEGIN statements • read the input files into memory, one record at a time; for each record check all patterns; if the pattern is TRUE for the current input record, execute the associated action statements; in C this would look like: while (get_another_input_record()) { for (pattern1 to patternN) { if (pattern is TRUE) { action statements for the pattern } } } • execute any END statements 4 Unless otherwise specified by redirection, all output via “print” or “printf” statements goes to the default standard output file, called “$tempStdOut”. 5 Comments in the source code, which begin with a “#” and continue to the end of the line, are ignored. BEGIN, END, and pattern–action blocks may occur in any order in the source for the program. Programs may also contain function definitions, which are introduced by the “function” keyword, and take the general form: "function" funcName(parameter1, parameter2,...local variables) { statements making up the function body } If a function is generally useful, it may be placed in a library file to save duplication. You’ll find little emphasis on libraries, since it costs very little to duplicate a function right in the main program, and this makes the programs easier to read. Library files should be reserved solely for function definitions to avoid confusion. hAWK automatically reads in your input files one “record” at a time, also breaking each record into “fields”. The current record is in the built-in variable $0, and the fields are in $1, $2, …$NF (where NF is another built-in variable giving the number of fields in the current record). By default a record is the same as a line and fields are separated by blanks or tabs, so you can think of the default as reading your input one line at a time into $0 and making the inidividual words available in $1, $2 etc (but note that all punctuation except blanks, tabs, and returns will still be present in the fields). For example, if the current line in an input file reads "for (i = 0; i<7; ++i)" then that will be the content of $0, and the fields will be $1 = "for", $2 = "(i", $3 = "=", $4 = "0;", $5 = "i<7;", $6 = "++i", with NF, the current number of fields, set to 6. Here’s a real program to give you a taste ("$EnumSwitch", in the “hAWK programs” folder): #$EnumSwitch #Select a bunch of enums, and run Hawk on the front selection # -optionally select the entire enum body from '{' to '}' with Balance #Leave "Show std out" and "Select all of stdout" checked { gsub(/=[^,]*/, " ")#remove initializations for the enum constants gsub(/=(.)*$/, " ")#ditto gsub(/[,{};]/, " ")#remove remaining punctuation, leaving just the enums for ( k = 1; k <= NF; k++)#build an array containing the enum names case[++i] = $k } END { print "switch (??)" print "\t{" for (k = 1; k <= i; ++k) { print "case " case[k] ":" print "\t" print "break;" } print "default:" print "\t" print "break;" print "\t}" }#end program Given a list of names from an enum definition, such as "{left, right, up, down, twilightZone = 999}" this program generates switch (??) { case left: break; ...etc… case twilightZone: break; default: break; } To run this program: first select a list of comma-separated names (typically use the contents of an enum definition); select "hAWK" from the calling application’s menu; select "$EnumSwitch" from the "Main program" popup; (note the "Take input from:" popup will then read "Front text selection"); and click the Run button. The generated "switch" statement will appear in a window called "$tempStdOut", ready to be copied and pasted into your working window. Grouping and breaking lines The rules for organizing and grouping your program lines differ a bit from the rules for C; a <Return> (also called newline) can stand for a semicolon after most hAWK statements, the price of this being that lines cannot be arbitrarily broken as in C, to avoid confusion between ending a statement and merely continuing it to the next line. The rules below are listed in rough order of their impact on whatever C formatting habits you have. • When in doubt, use a backslash '\' immediately followed by a <Return> to continue a long line, as with preprocessor macro’s and strings in C. For example: x = y + (z - 1) + SomeFunction(param1, param2\ , param3, param4) + w; • Long conditional tests can be broken to the next line immediately after any logical operator (&&, ||, !). Eg: if ( lineNumber >= maxLines && $0 != "") • A long line may be broken after a comma, eg x = y + (z - 1) + SomeFunction(param1, param2, param3, param4) + w; • The '{' that begins an action should be placed on the same line as the end of the pattern for it, eg FNR == 1 || FNR == 2 || FNR == 3 { #Note '{' is on same line as end of pattern print } • A comment in hAWK begins with a '#' and continues to the end of the line. A comment can be placed at the end of any line except a line that is continued with a backslash and <Return>. • Group multiple statements together with '{' and '}', as in C, eg if ($0 ~ /TEST/) { print "TEST on line", FNR ++numTests } • When in doubt, terminate a single statement with a semicolon. Multiple statements may be placed on one line if separated by semicolons, eg if (a >= b) print "a is bigger"; else print "b is bigger"; or do ++x; while (x < maxForX); • In if-else and do-while constructs, the “else” and “while” keywords should either be placed on a new line or preceded by a semicolon or '}'. In other words, clearly signal the end of the “if” or “do” part, so that the “else” or “while” doesn’t pop up by surprise: these are OK; if (a > b) ++b; else ++a if (a > b) ++b else ++a do {--x; print x} while (x > 0) these are not; if (a > b) ++b else ++a do ++x while (x < maxForX); ---------------------- The Command line and ARGV[] ---------------------- To run a hAWK program, you must tell hAWK which program to run, and what files to use for input data, with other optional details. Classically, these file names etc are passed to AWK in an array of pointers called argv; hAWK works the same way, but these names are generated for you when you set up a hAWK run using the setup dialog, saving you the work of typing them all in each time. All you really need to know about the command line is that, at the time a program is run, the names of the input files it is being asked to deal with are contained in the array named ARGV, and the number of input files equals ARGC-1 (where ARGV is a built-in array name, and ARGC is a built–in variable name). Input file names are full path names, so typical contents are ARGV[1] = "Disk:folder:...:folder:First_Input_file" ... ARGV[ARGC-1] = "Disk:folder:...:folder:Last_Input_file". Running the sample program “$EchoFullPathNames” on some input files will provide you with a specifc example—why not give it a try? Use your calling application to select some files for multi–file operations (“searching”), then run $EchoFullPathNames and see what results. This is the complete program: BEGIN { for (i = 1; i < ARGC; ++i)#note ARGV[0] is just "hAWK" print ARGV[i] } Details follow on the command line generated by hAWK’s setup dialog, in case you are interested in modifying hAWK. You may also find this background helpful if you use the hAWK() function, which executes another program from within a program and requires an explicit command line as its argument (see ch. Q, “The hAWK function”). The command line passed to hAWK from the setup dialog takes the general form hAWK -fProgramName {-fLibraryName} {-vVariable=value} -- {InputFileName} where the {} brackets indicate that an item may be repeated or omitted. For example, if running a program "$BigSort" with supporting library "Sort_Routines", with the files to be sorted being "Text1" and "Text2" then the command line passed to hAWK by the setup dialog will be something like hAWK -f$BigSort -fSort_Routines -- HardDrive:Code Folder:Sub Folder:Text1 HardDrive:Code Folder:Sub Folder:Text2 The "-f", "-v", and "--" are little markers that hAWK uses to tell what's what. "-f" means a program file, "-v" means a variable assignment, and "--" means that ony input files (if anything) follow this marker. By the time the command line becomes available to you within your hAWK program, the array "argv" is a hAWK array of strings called "ARGV" that contains only "hAWK" in ARGV[0] followed by the names of the input files in ARGV[1], ARGV[2] etc, and ARGC is set to the number elements in the ARGV array, namely the number of input files plus one. The last input file name is ARGV[ARGC-1]. Normally, the input file names are the only things on the command line of interest that you don't already have access to. You'll have acess to the variables anyway, and one can't help thinking that it would be an odd program indeed that needed to know its own "ProgName". Here's a hAWK program that prints a complete list of the input file names passed to it ($EchoArgs again): BEGIN { for (i = 1; i < ARGC; ++i)#note ARGV[0] is just "hAWK" print ARGV[i] } If you included this block in $BigSort above, then the output would be something like HardDrive:Code Folder:Sub Folder:Text1 HardDrive:Code Folder:Sub Folder2:Text1 —as you can see, you're getting the full path names of the file, not just the file names. Here's a version that prints just the file names proper: BEGIN { for (i = 1; i < ARGC; ++i) { n = split(ARGV[i], names, ":") print names[n] } } for which the output would for example be Text1 Text2 The important thing to note here is that hAWK deals with full path names for files, especially relevant if you are redirecting input or output (more on this later). When you assign values to variables using the "Set variables" button in the setup dialog, the result is the same as if you assigned the value in the BEGIN block of your program. However, you should NOT use quotes if you are assigning a text string to a variable using "Set variables"—for example, the variable assignment find=text to find within the "Set variables" dialog is equivalent to the statement BEGIN {find = "text to find"} within your actual program. This is meant to be a convenience, but is perhaps a nuisance, in that any spaces between the '=' and the value are significant: find =text is not the same as find = text —that space between the '=' and the 't' of "text" will be included in the string for "find". The "Set variables" button can be used to set the value of any hAWK variable, whether your own or a predefined (built-in) variable, and it is easier to change a variable this way than to edit the program itself. Up to 10 variables can be set with "Set variables", and your variable settings will be saved for the next run if you click the "Save settings" button in the setup dialog. For an illustration and more details, see the “Running hAWK programs” chapter. ---------------- Variables and constants ---------------- Variable names and types hAWK has many built–in variables, and you can use your own. A variable of your own devising springs into existence when you first use it, with no need to declare it (excepting perhaps local variables for functions, which need to be not so much declared as “mentioned”—see “Local variables in functions” below). Variable names in hAWK take the same form as C names: a letter or underscore followed by any number of letters, underscores, and numbers. hAWK has both scalar variables and one–dimensional arrays. The value of a variable or array element may be a (floating–point) number OR a string, and the specific type at any time depends on how you use the variable. While numeric values in hAWK are nominally floating–point, if you consistently use a variable as an integer you will get predictable results. For example, for (i = 0; i <= 1; ++i) print i will print two values, 0 and 1, guaranteed. Uninitialized variables have the numeric value 0 and the string value "" (the null, or empty, string). Note this differs from a variable that has been explicitly initialized to zero, for in this case while the numeric value will be zero the string value will be "0". Constants Constants can be integers, floating–point numbers, or strings. For example, x = "A string of text"; y = 7; z = .31415926E1; pat = "[_A-Za-z][_A-Za-z0-9]*"; (a string to be interpreted as a regular expression - it matches a hAWK variable name). Record and field variables After the BEGIN block(s) of a program have been executed, a hAWK program proceeds to automatically retrieve records from your input files one at a time to the built–in variable $0, and individual fields in the current record can be accessed with the built–in variables $1, the first field, $2 etc up to $NF, the last field, where NF is a built–in that records the current number of fields. Records are separated according to the string contained in the built–in record–separator variable, RS. By default this contains just a return, ie RS = "\n", so a record is the same as a line. You can change the value of RS, and setting RS to ""(the null string) will cause empty lines to be treated as the record separator. Note that the record separator itself is trimmed from the record. Similarly, fields are separated in accordance with the value of the field–separator variable, FS. By default the field separator is a regular expression standing for “one or more blanks or tabs”, and as a nicety if you use the default value of FS then any leading blanks or tabs will be trimmed away from the first field, $1. References to non-existent fields (fields after $NF ), produce the null-string. However, assigning to a non-existent field (e.g., $(NF+2) = 5 ) will increase the value of NF , create any intervening fields with the null string as their value, and cause the value of $0 to be recomputed, with the fields being separated by the value of OFS, the output field separator. A negative field number is an error. Many functions in hAWK allow you to optionally specify a string for them to work on, and if you don’t specify a string then it uses $0, the current input record, by default. For example, print "some text" will do just that—print the string "some text" to the standard output, whereas print all by itself, will print the contents of $0 to stdout, and thus it has the same effect as print $0 Note that “print” tags on the contents of ORS, by default a return, to its output, so in the default case the return that was trimmed away when retrieving the current input record is added back. Thus, the hAWK program that consists of the one line {print} will echo all of its input to stdout (the file $tempStdOut) without change, though a flurry of activity involving returns takes place behind the scenes. This little program prints the individual fields of each input record to individual lines: {for (i = 1; i <= NF; ++i) print $i } —note that the field specifier can be a variable as in “$i”, and doesn’t have to be a constant. Built–in variables hAWK's built-in variables are: ARGC the number of input files plus one ARGV array of command line arguments. The array is indexed from 0 to ARGC - 1, the input file names being ARGV[1] through ARGV[ARGC-1]. Dynamically changing the contents of ARGV can control the files used for data. FILENAME the name of the current input file. If no files are specified on the command line, the value of FILENAME is "-". A hAWK program may do all of its work in a BEGIN block, with no need for input (generating a list of random numbers for example). FNR the input record number in the current input file. Reset to 1 when starting a new input file. Hence the pattern “FNR == 1” detects the start of each file. FS the input field separator, a blank by default. If the default FS is used then leading blanks and tabs are trimmed from $1. IGNORECASE controls the case-sensitivity of all regular expression operations. If IGNORECASE has a non-zero value, then pattern matching in rules, field splitting with FS , regular expression matching with ~ and !~ , and the gsub() , index() , match() , split() , and sub() pre-defined functions will all ignore case when doing regular expression operations. Thus, if IGNORECASE is not equal to zero, /aB/ matches all of the strings "ab", "aB", "Ab", and "AB". The initial value of IGNORECASE is zero, so all regular expression operations are normally case-sensitive. NF the number of fields in the current input record. NR the total number of input records in all input files seen so far. OFMT the output format for numbers, %.6g by default. OFS the output field separator, a blank by default. ORS the output record separator, by default a newline. RS the input record separator, by default a newline. RS is exceptional in that only the first character of its string value is used for separating records. If RS is set to the null string, then records are separated by blank lines. When RS is set to the null string, then the newline character always acts as a field separator, in addition to whatever value FS may have. RSTART the index of the first character matched by match(); 0 if no match. RLENGTH the length of the string matched by match(); -1 if no match. SUBSEP the character used to separate multiple subscripts in array elements, by default "\034", some kinda up arrow very rare in text. (and three added for the Macintosh version) RUNERR short for "run error", a file name that you can use to print your own error messages to, as in print "Error during run" > RUNERR. Default name is $tempRunErr, and you'll find the file in the same folder as $tempStdOut. STDPATH path name that can be prefixed to any file name you wish to be written to the same folder as stdout ($tempStdOut). Typically looks like "Disk:folder1...:THINK C folder:" and typical use looks like outname = "MyOutFile" fullOutName = STDPATH outname; print "something" > fullOutName; TIME at start of run, eg "Sunday, October 13, 1991 07:58 AM" Local variables in functions Function definitions in hAWK resemble those of C a bit, but local variables require an odd syntax. They must be listed in the parameters of the function, after the real parameters, in order to be treated as local. All other variables in hAWK have global scope. For example, in function SumArray(arr, index, sum) { for (index in arr) sum += arr[index]; return sum } the only real parameter is the array name “arr”. This function sums up the contents of the array and returns the sum, used as in “sum = SumArray(x);” where x is an array containing numbers. The variables “index” and “sum” look like orphans there in the parameters, but this is just the hAWK way of declaring local variables. Both index and sum cannot be affected by any statements outside the SumArray function (that is, they are local in scope), and as a bonus hAWK initializes even local variables to 0 each time the function is called. Functions are described in more detail a little later in the chapter “User-defined functions”. Setting variables on the command line When variables are set using the “Set variables” option in the setup dialog, no quotes should be used around strings, and no space should be put between the equals sign and the string or number unless you want it to be included in the value. For example, the equivalent of BEGIN {find = "some text to find"; first = 7;} in the “Set variables” dialog would be find =some text to find first =7 (the space before the equals sign is optional). Conversion between numbers and strings Conversion of a variable’s value between number and string is automatic in hAWK when circumstances call for it, and can be forced by you as well. When an operator is strictly numeric, the value of its operands will be forced to numbers if necessary, and similarly if an operator expects to deal strictly with strings then values will be forced to strings. For example, in a = "102"; b = a + 1; “a” starts out as a string, but the “+” operator deals strictly with numbers, so “a” is converted to the number 102.0 on the second line. And in a = 27; b = "trombones"; c = a b; #there is a space between a and b we see the invisible “concatenation” operator at work. Two variables or constants separated by just a space are treated as strings by hAWK and concatenated together. So “a” is converted to a string on the third line, and “c” ends up holding the string "27trombones". Some operators (all of the comparison operators == <= >= etc for example) can accept either strings or numbers. When this is the case, the rule is that the operation proceeds numerically if both operands are currently valid numbers, but proceeds as a string operation otherwise. You can force a variable to be treated as a string by concatenating the null string to it. For example, no matter what the values of a and b are, the comparison a "" == b will proceed as a string comparison. And you can force a variable to be treated as a number by adding 0 to it, as in a + 0 == b + 0 but note in this case that both operands should be forced to numeric type. Arrays Arrays are subscripted with an expression between square brackets, arr"["expr"]". Array values can be numbers or strings, but the index is always interpreted as a string. For example, when you write arr[1] the 1 is converted to the string "1" for use as the array index, so arr[1] is the same as arr["1"]. This sort of array is called “associative” since it can associate one string of text with any other, eg arr["John Henry"] = "was a log-drivin man" If the index expression is an expression list ( expr1, expr2, expr3,... ) then the array subscript is a string consisting of the concatenation of the (string) value of each expression, separated by the value of the SUBSEP variable, which is by default “\034” (decimal 28, an up arrow). This facility is used to simulate multiply–dimensioned arrays. For example: i = "A" ; j = "B" ;k = "C" x[i, j, k] = "hello, world" assigns the string "hello, world" to the element of the array x which is indexed by the string "A\034B\034C". The special operator "in" may be used in an "if" statement to see if an array has an index consisting of a particular value: if (val in array) print array[val] If the array has multiple subscripts i j k, use if ((i, j,k) in array) instead . The alternate if (array[val] != "") actually creates the array array[val] element if it does not exist, so using “in” is usually better. The "in" construct may also be used in a for loop to iterate over all the elements of an array: for (i in arr) delete arr[i] # or print arr[i] , or print i, arr[i] An element may be deleted from an array using the delete statement. New elements should not be added to an array while looping over it with the "in" for-loop, since hAWK isn’t quite smart enough to handle that very well. Behind the scenes, indexes for an array are stored in a hash table, Retrieval of an array element takes constant time up to a moderate array size (~1000), but as array size increases retrieval time will increase as a linear function of the size. Some array examples: for (i = 1; i <= 100; ++i) x[i] = i; This does what you would expect, creating x[1] =1, ...x[100] = 100. Note, however, that while i is treated as an integer in the for loop, it is converted to the string representation for that number when used as the index for x. for (i = 1; i <= NF; ++i) wordCounter[$i] += 1; Here we see the real power of hAWK’s associative arrays. $i is a string containing a field on the current input line, and this string is used as an index into the wordCounter array. If there is no element in the array yet for the index, a new element is created (and initialized to 0/the null string, as for regular variables). The array element itself holds just a count of how many times the string has been seen. Obviously, you can’t access these array elements by incrementing a numeric index—here’s where “in” comes in: for (word in wordCounter) print word, "was seen", wordCounter[word], "times." prints out the words used to index wordCounter, together with the word counts, a sample line being parsimonious was seen 1 times. The one drawback of this simple example is that the words will be printed in a rather arbitrary order (internally, the entries in a hash table are being accessed). However, even this shortcoming can be overcome. The sample program “$WordFrequency” shows how to sort an array such as wordCounter into dictionary order on the index. while (getline x > 0) lines[++n] = x; The “getline x” will retrieve records from your current input file to the variable x, from the current position to the end of the file. Each record is saved away as an element in the array “lines”. Here the index is a number (technically the string for the number) and the element is a string —the reverse of the last example. times[3,7] = 21; The actual index is "3" "\034" "7" concatenated together. A multi-dimensional array can be run through in the same way as in C: for (i = 1; i <= iMax; ++i) { for (j = 1; j <= jMax; ++j) { print times[i,j] #or whatever } } # note "for (k in times) print times[k]" could also be used. ------- Patterns ------- Patterns and actions At the top level, a hAWK programs consists of patterns and actions, of the general form pattern { action } When a pattern evaluates to true (non–zero), the corresponding action is taken. Patterns resemble the conditions found in a C if-statement, but several kinds of patterns, notably BEGIN, END and patterns using the matching operator '~', are not found in C. As described earlier, hAWK will automatically read in your input one record at a time to the variable $0, and each pattern is evaluated in turn; if the pattern is true for the current input, then the action statements are executed. A missing pattern evaluates to true, so action statements with no preceding pattern are executed for every input record. A missing action is equivalent to { print } which prints the input record to stdout. It’s equivalent to {print $0}, by the way. Here’s a sample pattern-action block that is often useful: FNR == 1 { z = split(FILENAME, names, ":") } FNR stands for "file number of records", reset to 1 at the beginning of each input file. FILENAME is a variable holding the full path name of the current input file. The split on ':' splits FILENAME into an array, treating the ':' as the element separator. Often, one wants just the file name proper without the disk and folders, and this is given by names[z]. For example, if FILENAME = "Disk:folder:thefile" then the split produces names[1] = "Disk", names[2] = "folder", and names[3] = "thefile", with "z" being set to 3. The statement "print names[z], FNR" will print the current input file name and current line number to stdout. The “Summary of patterns” section at the end of this chapter contains a small program that will let you try out patterns as they occur to you. Or you could use $RunClip. BEGIN and END BEGIN and END are two special kinds of patterns which are not tested against the input. The action parts of all BEGIN patterns are merged as if all the statements had been written in a single BEGIN block. They are executed before any of the input is read. Similarly, all the END blocks are merged, and executed when all the input is exhausted (or when an exit statement is executed). BEGIN and END patterns cannot be combined with other patterns in pattern expressions. BEGIN and END patterns cannot have missing action parts. BEGIN {FS = ",[ \t]*|[ \t]+"} sets the field separator to either a comma followed by optional blanks and tabs or one or more blanks and tabs—a common field separator in a real database. END blocks are often used to finish up after all the input has been seen, as in this little program: {out[++n] = $0} END {for (i = n; i >= 1; --i) print out[i]} which accumulates all input records in the array “out”, and then at the end prints out the records in reverse order. Expressions as patterns Simply put, an expression is any sensible combination of variables, operators, and (rarely) function calls. When an expression used as a pattern evaluates to a non–zero or non–null result, the action following it will be carried out. The most common sort of expression used as a pattern is the comparison, involving the operators ==, <=, >=, >, <, and !=. These can be used with any hAWK variable or calulated result, and it is a refreshing improvement over C to be able to test two strings for equality with the simple “a == b” instead of “!strcmp(a,b)”. Comparison patterns quite often involve tests on the current input, such as “$1/$2 >= 100”, “$3 == "Wilhelmina"”, “$0 != ""”, the last testing that the current input line is not empty. Built–in variables are also popular, as in the “FNR == 1” example a few paragraphs above, which detects the start of an input file. Your own variables can of course appear, as in $1 != lastFieldOne { print "New field one is", $1 lastFieldOne = $1 } which prints the contents of the first field on the input line whenever it changes. In a comparison, if both sides are numeric then the comparison is made numerically, but if one side evaluates to a string then the comparison is done in terms of strings, with the other side first being converted if necessary to a string. String-matching patterns The matching operator, denoted by a tilde (~), allows you to detect whether one string contains another string, though technically that other string is treated as a “regular expression”. More on regular expressions in just a minute, but for now you can form a regular expression to look for from a string of characters by putting a forward slash before and after them. For example, if you wish to determine if the current input line contains the string "exception", then the pattern $0 ~ /exception/ will do it. Note that it could match the line "while this is not an exceptional case, there are other" that is, the match does not have to be an entire word. By default if you omit the string for the matching operator to check against, and further omit even the matching operator, leaving just the regular expression enclosed in slashes, then the match will be done against the current input line $0. In other words, /regular expression/ {action} is the same as $0 ~ /regular expression/ {action} —and since even the action is optional (recall the default is to print $0), about the shortest hAWK program you can write is /a/ #equivalent to $0 ~ /a/ { print $0 } which will print any input line containing an “a” to stdout. To match punctuation explicitly in your expression you should precede it with a backslash, eg /question\?/, /the end of the sentence\./, /array\[index\]/. You can use quotes instead of the forward slashes to surround the text of your regular expression with the same results. In this case, though, the matching operator must explictly appear. Eg $0 ~ "Mars" {print "red planet detected on input line", FNR} And to match punctuation explicitly inside the quotes, you should precede the punctuation with two (that’s right, two) backslashes. For example, to match "the end." use string ~ "the end\\." Using forward slashes instead of quotes around your regular expression has three small advantages; matching against $0 doesn’t need to be fully written out, only single escapes are necessary to match punctuation, and after a while the forward slashes will stand out as you read your programs, signalling a matcher. The negation of the matching operator, “!~”, allows you to determine if a string does not contain some regular expression, as in $2 !~ /A/ {print "Error, second field does not contain the letter A"} and any points mentioned above for ~ apply to !~. Regular expressions Regular expressions aren’t as hard to use as a first impression suggests, and if you try out a dozen you’ll be hooked, guaranteed. In regular expressions certain characters have special “powers” that allow you to search for entire related groups of strings with a single specifying string. Consider that an ordinary “find” command will not let you completely match the following variations of a string: plurals; possessives; variable blanks, tabs and especially returns between the words of a string; one or more alternate words in the string; the complete word that contains some special substring; two or more complete strings at once (one or the other). A regular expression is nothing more than a string of text with optional special “metacharacters”, and in most cases the string to be used can result from the evaluation of a variable, or the concatenation of several strings or variables. This means you can build the regular expressions for your program during the execution of your program, modifying them on the fly to suit changing circumstances. Parts of a regular expression can be grouped (with ordinary parentheses), and later in the regular expression or in a replacement string can be referred to by the group “tags” \1, \2, ... \9 where \1 refers to the group started by the first left parenthesis, \2 to the second, etc. These allow you to match a small pattern within the context of a larger one, detect duplicate expressions, change the order of the groups and so on. Note that parentheses have the highest precedence of all regular expression “operators”, so they serve two purposes; changing the order in which the metacharacters apply, and marking the boundaries of a group, for later reference via \1..\9. More on this in a bit. Regular expressions are built from ordinary characters, the escape sequences \t \n \b \B \w \W \< \> \1 \2 \3 \4 \5 \6 \7 \8 \9 and from the metacharacters \ ^ $ . [ ] | ( ) * + ? which are the ones with the special powers mentioned above. As you saw in the above section, if a regular expression contains no metacharacters then it behaves like an ordinary “find” string in that each character in the regular expression must match a character in the string being searched. The following table summarizes all character usage in a regular expression (where a b c are ordinary characters, m is a metacharacter, r is a regular expression, and d is a digit): c matches the non-metacharacter c itself \m matches the literal character m, eg \$ matches the dollar sign. . matches any single character except newline. ^ matches the beginning of a line or a string. $ matches the end of a line or a string. [ abc... ] character class, matches any one of the characters a or b or c etc... . [^ abc... ] negated character class, matches any character except abc... and newline. (Ranges of characters may be abbreviated in character classes, as in [0-9] which matches any digit, [A-Za-z] which matches any letter, [^0-9] which matches anything but a digit). \w matches a “word” character, exactly equivalent to [0-9A-Za-z] \W matches a non-word character, ie [^0-9A-Za-z] \< matches the beginning of a word. \> matches the end of a word. \b matches the beginning or end of a word (a word boundary). \B matches the boundary (beginning or end) of a set of non-word characters. \t matches a tab. \n matches a newline (the Return key). r1 | r2 alternation: matches either r1 or r2, eg "blue|green" r1r2 concatenation: matches r1 followed by r2 . r + matches one or more r 's. r * matches zero or more r 's. (Note that zero r’s can be anywhere in the text) r ? matches zero or one r 's. ( r ) grouping: matches r. Parentheses have two distinct uses; to override default precedence of metacharacter operators, and to tag a subexpression for subsequent reference. \1...\9 stand for whatever text the first through ninth set of parentheses currently match, counting opening parentheses from left to right. Note that if the pair of parentheses has a + or * or ? operator after it, then all of the matches are included, eg /(foo)+bar/ applied to "foofoofoobar" will set \1 to "foofoofoo". To get just the first foo, use /(foo)\1*bar/ - then \1 is set to "foo". (Perl users note this is the opposite of what you are used to). \ddd is interpreted as an octal number, as in C. The digits exclude 8 and 9, needless to say, and there can be from 1 to 3 digits in the number. Note that \1 through \7 are interpreted as subexpression tags unless followed immediately by another octal digit (eg \23 is not tag 2 followed by a 3, it is the octal number 19 decimal). \8 and \9 are always tags, since 8 and 9 are not octal numbers. To refer to octal numbers 1 to 7, use \01 to \07. To follow a tag with a low number (eg \2 followed by 3), use the octal representation of the number (eg \2\063 -- \063 equals 51 decimal, the ASCII code for 3). The metacharacters ^ and $ to match the beginning and end of strings, and \b \B \< \> to match various boundaries don’t actually match any characters; rather they force alignment to a particular text position. For example, /\brun\b/ will always match just “run” if it matches anything, but will not match "runner" or "brunt". By comparison, /\Wrun\W/ won’t match “runner” or “brunt” either, but it will include any non–word character that happens to come before or after the word “run”. Normally you won’t want to include leading or trailing spaces etc in the match. Parentheses () have the highest precedence, allowing you to override default precedence when needed. The “repetition” operators * + ? have the next–highest precedence, followed by concatenation, with alternation having the lowest precedence of all. For example, in abc*d the * applies only to the c since the repetition operator acts before concatenation, and in abd|def the | applies to abd and def since concatenation binds them together into little groups of three before alternation can play. Regular expression can be used to just locate an instance of a pattern, as in $0 ~ /extern/ but they can also be used to specify text for replacement, by using the “sub” and “gsub” functions. Looking ahead just a bit, these functions take a regular expression as the first argument, the string to use for replacement as the second argument, and the string to do the search and replace in as the third argument, with $0 used by default if there is no third argument. “sub” does a single substitution on the text, and “gsub” does all possible non-overlapping substitutions. Within the replacement strings of these functions, you can use \1 through \9 to refer to text currently matched by tagged subexpressions, and the ampersand “&” stands for all of the text that was matched. To put a plain ampersand in the replacement, use “\&”. At this point some considerable exampling usually helps: The quick brown matches just that, "The quick brown". Note it would match "The quick brown" in "The quick brownie". red fox\. matches "red fox." (the period must be escaped for a literal match). [ \t] matches a single space or tab ( that’s a space before the \). [ \t]+ matches any consecutive run of spaces and tabs in any mix. [0-9]+ matches an integer (read “one or more digits”) [+-]?[0-9]+ matches an integer, together with optional preceding sign. \<[A-Za-z'’-]+\> matches an English word. houses? matches "house" or "houses". m(iss)*ippi matches "mippi", "missippi", "mississippi", "missississippi", etc. ar*g matches "ag", "arg", "arrg", "arrrg", etc. MyFunction$ matches "MyFunction(". array\[index\] matches "array[index]". array\[.+\] matches "array[i]", "array[j]", "array[2*q-1]", etc. \\([0-7]|[0-7][0-7]) matches "\d" or "\dd" where d is an octal digit. ([^\\]?|(\\\$+)" (horrors, be brave) matches an unescaped quote or a quote preceded by an even number of backslashes—in other words a true quote in C. The backslash is a metacharacter, so matching one literally requires a backslash before the backslash. The[ \t]+quick[ \t]+brown matches "The quick brown" with variable spaces and tabs between the words. \/\* matches the start of a C comment, "/*". The forward slash is escaped so that you can place the whole regular expression inside forward slashes. The escape before '/' would not be needed if you placed the expression inside quotes, but then you would need two escapes before the '*', ie "/\\*". \/\*.*\*\/ matches all of a one–line C comment, "/* - anything - */". ^Z matches a 'Z' at the beginning of a string. ^. matches the first character of a string. .$ matches the last character of a string. ^.*$ matches any string completely (not much use). ^A..$ matches any string which is three characters long, the first being an 'A'. ^(A|B).* matches all of any string that begins with 'A' or 'B'. ^[AB].* does likewise. (\w|_)\w* matches a C term, or integer constant. ((->)|(\.))(mem\b) matches “mem” when it is immediately preceded by “->” or “.”, and is not the beginning of a longer word. For replacement purposes in a “sub” or “gsub”, the part before “mem” is given by \1, and mem itself is \4. gsub(/((->)|(\.))(mem\b)/, "\1\4ber") will turn “->mem” into “->member” and “.mem” into “.member” everywhere in the current input line $0, ignoring things like “remember” or “->memories”. gsub(/\bFuncName([ \t]*\()/, "FunctionName\1") will replace “FuncName” by “FunctionName” everywhere in the current input line $0, provided it is followed on the same line by an opening parenthesis, with optional spaces or tabs between the name and “(”. The match extends from the “F” of “FuncName” up to and including the “(”, so the “(” and any intervening white space must be put back into the replacement string by tagging them in parentheses and using \1 after “FuncName” to refer to what was matched by the first set of parentheses in the pattern. This program prints all input lines containing one-line comments: /\/\*.*\*\// {print} (since {print} is the default action, it could be left out). Within a character class most metacharacters are taken literally. The exceptions are the escaping backslash \, the negating ^ (only at the beginning), and the range hyphen - (only between two characters). For example, [A-Za-z-] matches an English word, hyphens included [-A-Za-z] does the same [\-A-Za-z] also does the same (the '\' is unnecessary but harmless) ^[^^] matches any single character that is not a '^' at the beginning of a string [\^] matches a '^'. The toughest metacharacter to remember is the '^' which has three meanings: at the beginning of a character class it signals a negated character class; outside of a character class it matches the beginning of a string; and when escaped or not the first character in a character class it matches a literal '^'. Regular expressions are “left greedy”; where there could be more than one match in a string, a regular expression matches the leftmost one, and extends the match as far as possible. For the implications of this, see the discussion of the “match” operator in the “Built–in string and file functions” section of the next chapter, “Actions”. Now that we’re starting to get the hang of things, more examples using the replacement functions “sub” and “gsub” mentioned above. The format is sub(r,s,t) where r is a regular expression, s is the replacement string, and t is the string in which the search and replace is to be done. The contents of t before and after the sub are spelled out below. using t = "Don’t run that prune over, runt!": sub(/run/, "fly", t) turns t into "Don’t fly that prune over, runt!" gsub(/run/, "fly", t) turns t into "Don’t fly that pflye over, flyt!" gsub(/\brun\b/, "fly", t) turns t into "Don’t fly that prune over, runt!" gsub(/run/, "t&k", t) turns t into "Don’t trunk that ptrunke over, trunkt!" using t = "#define FOO 1": sub(/#define\W+(\w+)\W+([0-9]+)/, "int \1 = \2;",t) turns t into "int FOO = 1;" (\W+ means one or more non-word characters, \w+ means one or more word characters, [0-9]+ means one or more digits; two groups are tagged). Three programs are supplied to help you do general–purpose listing of matches or search–and–replace; $MFSLister searches for either plain text or a regular expression with “Set variables” in the setup dialog, and lists file name/ line number of all single–line matches to stdout; $MFS_SuperLister does much the same, but finds matches that span a variable number of lines; and $MFS_SuperReplace does the ultimate search and replace, matching either plain text or full–blown regular expressions over a variable number of lines, handling any number of files at once, documenting the (post–change) locations of all changes to stdout. Heck, it even prints the fragments of original text before the changes, so that if you mess up you can at least (manually) undo the damage. (Exercise: write $MFS_Undo_SuperReplace). Compound patterns The logical operators ||, &&, and ! can be used to combine simple patterns into compound ones. These operators function the same as in C, specifically: || is the inclusive–or operator; && is the and operator; and ! is negation, with evaluation of a compound pattern proceeding only as far as necessary to determine whether the whole pattern is true or false. Some examples: $1 ~ /DATA/ && $2+0 > 0 is true when the first field contains the string "DATA" and the second field is numeric and greater than zero. If the first field does not contain "DATA" then the second field is not checked. $1 == "DATA" || $1 == "INFO" is true when the first field is exactly equal to "DATA" or "INFO". The check for "INFO" is performed only if the check for "DATA" fails. $2 != 0 && !($3/$2 > 10 || $3/$2 < 1) first checks that $2 is not zero, to avoid dividing by zero, and then evaluates to true if $3 divided by $2 falls in the range 1 to 10. The ? : operator can be used to choose between two patterns, and is like the same operator in C. If the first pattern is true then the pattern used for testing is the second pattern, otherwise it is the third. Only one of the second and third patterns is evaluated. $2 != 0 ? $3/$2 > 1 : $3 == 0 first checks to see if field 2 is non–zero; if so, the pattern is true if $3/$2 > 1; otherwise, the overall pattern is true if field 3 is also zero. Range patterns Range patterns consist of two patterns separated by a comma. Given pattern1, pattern2 this evaluates to true for the first input line that matches pattern1, and thereafter is true up to and including the first line encountered that contains pattern2. Both patterns may occur on the same line, in which case the range pattern is true for just the one line (and a check for pattern1 begins again on the next line). If the second pattern is never seen, matching continues to the end of all input. Range patterns, as with BEGIN and END, cannot be compounded with other patterns to form more complicated patterns. Note that pattern2 specifies the last line to be matched, for example NR == 1, NR == 2 matches the first and second lines of input. Range patterns are useful only with input that has been well–organised on a line–by–line basis, with clear signals for the start and end of a group of lines. An ideal case would be a file with markers dedicated to indicating the start and end of a group, such as Start 10 11 -23 47 101 96 End Start 19 23 End etc in which case your program could analyze groups with /Start/, /End/ {actions for the group} but in real life the only way you’ll see an input file like this is if you make it yourself. Summary of patterns A list of beasts in the pattern zoo (regex stands for regular expression, pat stands for pattern, str stands for string variable): Pattern Example ---------------- ------------------------------- BEGIN BEGIN blocks are done before all input END END blocks are done after all input /regex/ /Mary( \t)+had/ str ~ /regex/ (or !~) $1 ~ /(\-)?[0-9]+/ str ~ "regex" (or !~) $1 ~ "(\\-)?[0-9]+" relational expression NF > 4 pattern && pattern FNR == 1 && /File title:/ pattern || pattern /Vermont/ || /Maine/ pattern ? pattern : pattern $3 != 0 ? $2 / $3 > 25 : $2 < 0 ( pattern ) - see next line ! pattern !($0 == "" || $0 ~/^The end$/) pattern1 , pattern2 FNR == 5, FNR == 8 There’s no substitute for doing it yourself. Here’s a small program that will let you try out your own patterns—it’s not saved separately, so select it and save it into your “hAWK programs” folder under a name that begins with a '$', such as “$PatternTester”. Substitute your test pattern for the word “pattern” below when you have one to try out. Grab some example input from somewhere, paste it into a new window, call hAWK, select “$PatternTester”, and run it with the “All of front text” input option, leaving “Show stdout” with a check mark. All input lines that match your pattern will produce a comment in stdout, which will be shown to you after the run. #A small program for testing patterns. #Replace the word "pattern" on the next line with your pattern. pattern { print "Pattern matched input line", NR, "which was:" print "\t", $0 ++n } END { if (n > 0) print "Total matches:", n; else print "No matches were found."; }#the end ------- Actions ------- Introduction Virtually everything you have learned about patterns can be carried over to actions for constructing conditional tests (excepting BEGIN, END, range patterns, and default behaviour when parts of a pattern are left out). For example, $1 ~ /NUM/ {if ($2 ~ /RANGE/) --then the first field contained "NUM", and the second field contained "RANGE"-- } or FNR < 10 {if (FNR == 1) print "First line of current file is:", $0 else if (FNR == 2) print "Second line of current file is:", $0 etc } which demonstrate that it is possible to place a general test in the pattern, and then proceed with more specific tests in the action statements. You’ve probably noticed that hAWK expressions strongly resemble C code, and this is no accident—leaving aside the advanced machinery of C dealing with pointers, structs and unions, and multi–dimensional arrays, what you know about writing C carries over to hAWK. There are some omissions, such as no need to declare variables, no prototypes for functions, no brackets around the arguments of some built–in functions (print, getline) that require a bit of adjustment. And there are some additions (most notably regular expressions, built–in string functions such as “match”, and the way input is automatically retrieved to $0) which require a bit of work to grasp comfortably. But regular expressions were the only tough part; the rest is easy by comparison, and you should count your hAWK diploma as a foregone conclusion if you keep going here. You have met variables, including built–in and field variables, and the operators which are especially useful for building patterns: the sections below will round out the list of operators, describe hAWK’s built–in functions dealing with numbers and strings, and introduce control–flow statements (if, for, while, etc) which allow you to choose between alternatives or repeatedly excute statements. Knowledge of C will speed up learning hAWK. However, hAWK is simpler than C, so if you are new to C as well you should find that learning hAWK will speed up learning C. Whatever your background, you should regard hAWK itself as an essential part of this manual; if you have a small problem, or an idea that wants polishing, whip up a little hAWK program and give it a try. A preview of “print” Ultimately, your hAWK program will produce output. The “print” statement will answer most all of your output needs, being simpler in form than the “printf” function which has more sophisticated formatting. Pass “print” a list of variables or constants separated by commas, and they will be printed to stdout, with the commas replaced by the output field separator (the built–in variable OFS, by default a blank). The contents of ORS (the output record separator, by default a newline) will be appended to the end of what was printed. For example: this one–line program {print FNR, $0} will duplicate all input to stdout, adding a line number to the beginning of each line. The number will be reset to 1 at the beginning of each input file, but all input files will be concatenated together in stdout. {print $1} will print just the first field of each input line to stdout. $1 ~ /extern/ {print FILENAME, FNR} will print the (full path) file name and line number where the word “extern” was seen. Variables and strings may be concatenated together by using a space instead of a comma between them, for example a = "Sesqui" b = "alien" print a "ped" b which produces "Sesquipedalien" (note there is no built–in spelling checker). Concatenation is slower than using commas to separate the items for “print”, best used only if you must avoid having the OFS space between two items. Note that print a, "ped", b produces "Sesqui ped alien". More on “print” later, but for the time being if you find yourself wondering what an operator or function produces—assign the result to a variable and print it out. Expression operators With the exception of string concatenation and the matching operators, the operators in hAWK are the same as C operators. They apply to both numbers and strings wherever it is logical, and that numbers are floating point numbers. Note that if a variable is assigned an integer value then it can be treated as an integer—for example, if i = 1 at some point, then later the test if (i == 1) will evaluate to true (non-zero), with no failure due to obscure floating point rounding trouble. The operators in hAWK, in order of increasing precedence, are: -------------------------------------------- = += -= *= /= %= ^= Assignment. Both absolute assignment ( var " = " value ) and operator-assignment (the other forms) are supported. “a += b” is equivalent to “a = a + b”. ?: The C conditional expression. This has the form expr1 " ? " expr2 " : " expr3 If expr1 is true, the value of the expression is expr2 , otherwise it is expr3 . Only one of expr2 and expr3 is evaluated. || logical OR. In “a || b” if a is true then b is not evaluated. && logical AND. In “a && b” if a is false then b is not evaluated. ~ !~ regular expression match, negated match. See “String-matching patterns”. < <= > >= != == the regular relational operators. Note especially that strings can be compared, eg if ($3 == "cat"). In “a <= b” or the like, if both arguments are numbers the comparison is done numerically, otherwise they are compared as ASCII strings. blank string concatenation; if a = "John" and b = "Henry" then c = a b; produces c = "JohnHenry". + - addition and subtraction. * / % multiplication, division, and modulus ( x%y produces the remainder of x divided by y, equivalent to x - int(x/y)*y ). + - ! unary plus, unary minus, and logical negation. ^ exponentiation. ++ -- increment and decrement, both prefix and postfix. $ field reference. $0 is the entire current record, $1 the first field, and $NF the last field. Fields may be changed or added. Some examples: {lines[++n} = $0} accumulates all input lines to the array lines[]. The variable “n” starts out as 0, so the “++n” produces 1 as the first index. At the end of input “n” is equal to the number of input lines seen, so END {print lines[1]; print lines[n]} would print out the first and last lines of input. Built–in numeric functions hAWK has the following pre-defined arithmetic functions, with x and y as arbitrary expressions: atan2( y , x ) returns the arctangent of y/x in radians. cos( x ) returns the cosine of x in radians. exp( x ) the exponential function "e to the x" int( x ) truncates to integer (eg int(7.325) gives 7); to round, use int(x + .5). log( x ) the natural logarithm function, base e. For log base 10, use log(x)/log(10). rand() returns a random number, 0 <= rand() < 1. sin( x ) returns the sine of x in radians. sqrt( x ) the square root function. srand( x ) use x as a new seed for the random number generator. If no x is provided, the time of day will be used. The return value is the previous seed for the random number generator. Some examples: atan2(0,-1) gives π, and exp(1) gives e. theta = atan2(y,x) r = sqrt(x*x + y*y) converts rectangular x,y to polar r,theta. int(max * rand()) produces a random integer from 0 to max-1, inclusive. Built–in string and file functions There is only one string operator, the concatenation operator, invoked when two variables or constants are separated by a space. Other useful string manuipulations in hAWK are carried out by built–in functions. In the following table, r is a regular expression, s and t are strings, the a and b are arrays, and i and n are integers. gsub(r, s, t) for each substring matching the regular expression r in the string t , substitutes the string s , and returns the number of substitutions. If t is not supplied, uses $0 . index( s , t ) returns the index of the string t in the string s, or 0 if t is not present. length( s ) returns the length of the string s . match( s , r ) returns the position in s where the regular expression r occurs, or 0 if r is not present, and sets the values of RSTART and RLENGTH . split(s, a, r) splits the string s into the array a on the regular expression r , and returns the number of fields. If r is omitted, FS is used instead. sprintf( fmt , expr-list ) prints expr-list according to fmt , and returns the resulting string. See the discussion of “printf” for details. sub(r, s,t) this is just like gsub , but only the leftmost matching substring is replaced. Returns number of substitutions. substr(s, i, n) returns the n-character substring of s starting at i . If n is omitted, the rest of s is used. tolower( s ) returns a copy of the string s , with all the uppercase characters in s translated to their corresponding lowercase counterparts. Non-alphabetic characters are left unchanged. toupper( s ) returns a copy of the string s , with all the lowercase characters in s translated to their corresponding uppercase counterparts. Non-alphabetic characters are left unchanged. lookup( s ) returns integer–coded C type of s (s should be a word). (At present this function is supported by: EnterAct. Types are taken from whatever project is open at the time.) See “$LookupTest” or “$XRef” for an example. Type integer returned ---- ------------ defined constant or macro 1 file–scope variable 2 function 4 enum constant 8 typedef 16 struct tag 32 union tag 64 enum tag 128 other 0 sort(a,b,s) produces an index in the array “b” that can be used to access the elements of “a” in sorted order. The string “s” specifies the kind of sort; "a" for ASCII, "n" for numeric, "d" for dictionary order, and "ra", "rn", "rd" for reverse of the same. Returns the number of elements in the array “b”, which is indexed numerically from 1 upwards. The elements of “b” are the indexes of “a” in sorted order provided “b” is accessed in the sequence b[1], b[2], b[3] etc. Typical use is maxIndex = sort(a, b, "d") for (i = 1; i <= maxIndex; ++i) print a[b[i]] which will print the elements of a in sorted dictionary order. See “$WordFrequency” and “$XRef_Full” for examples, and “$SortTest_Nums” for a simple numeric example. time( ) returns the current time, eg "Sunday, October 27, 1991 09:03:30 AM" —note this is the time when the function is called, down to the second, whereas the TIME variable holds the time at which your program run starts, down to the minute. See “$TIME” for an example. prompt( s ) displays an OK/Cancel dialog. The string “s” appears at the top of the dialog, and you can type in a string in an edit field. Returns what you type in, as though it was a string constant. Both the string “s” and what you type in are limited to 255 characters. For an example of usage see “$PromptTest” and “$YoungMath”. Typical use is x = prompt("Enter the number of lines to print:") if (x+0 > 0) { while (getline lne > 0 && ++i <= x) print lne } If you cancel the dialog or hit <Return.> without typing in any text, prompt returns the null string "". NOTE this function is only useful if hAWK is called up in the “immedate” mode (typically hold down the <Shift> key when selecting “hAWK”). In “concurrent” mode, “prompt()” does nothing but return the empty string "" without displaying a dialog. progress(s) displays the string “s” in a dialog on your screen (the message stays on the screen). You can change the message with another “progress” call. “progress” returns the number of times it has been called, and the dialog goes away by itself at the end of your program run. For a test sample, see “$ProgressTest”. NOTE this function is only useful if hAWK is called up in the “immedate” mode (typically hold down the <Shift> key when selecting “hAWK”). In “concurrent” mode, “progress()” does nothing but return 0. --and added for hAWK version 2 (mainly file functions): Note in the functions below where a file or directory name is required it must be a full pathname, of the form “disk:folder1:folder2:...:folderN:filename” for a file, or “disk:folder1:...:folderN” or “disk:folder1:..:folderN:” for a directory (the second version has a colon at the end). For a disk name, use “disk:” rather than “disk”. beep( n ) does a SysBeep(n); if the duration "n" is <= 0, the menu bar will flash instead. Durations of 0,1,2,5 work best. copy( s, t ) copies the file named “s” to the file named “t”. Both file names must be full pathnames (disk:folder:...folder:filename). Either the location or name or both can be changed. If file “t” already exists, it must be closed and unlocked. Both creator and type are preserved, and the resource fork is copied as well as the data fork. Any kind of file can be copied. To move or rename a file, use if (copy(s,t)) remove(s) (this is an efficient way to move a file, but there is a separate rename() function). NOTE that t's folders will be created if needed. Returns 1 if successful, 0 if the copy could not be done. exists( s ) returns 1 if the file named “s” exists, 0 if it does not. Any kind of file can be tested. fdate( s ) returns date/time of last modification of file named “s”, format “yr:mo:day:hr:min:sec” where yr is 4 digits, and the rest are 2 (eg always 01 rather than just 1). The length of the string is always 19 (or 0 if no date could be extracted) and the colons and digits always occupy the same positions. fsize( s ) returns size in bytes of the data fork only of the file named “s” getclip( n ) returns the calling application’s current clipboard text, up to a maximum of the first “n” bytes. Use n = 0 or omit it entirely if you want the entire clipboard. For example, if the current clip is “Some text here” then getclip(6) returns “Some t” whereas getclip(0) or getclip() returns the entire clip. At present this function is supported by: EnterAct. putclip( s ) replaces the calling application’s (private) clipboard with the string “s”. Note that other applications won’t see the change until you switch out of the calling app. The length of s is limited to 32,767 characters (as are all hAWK strings). See the “$Clip...” functions in the “hAWK programs” folder for examples using getclip/putclip. Supported by: EnterAct. list( s, a ) given file or directory full pathname in “s”, produces list of full pathnames for all TEXT files in the directory (either the directory named or the directory holding the file), as elements indexed 1,2,3... in the array “a”. Note subdirectories are also excluded. Returns the number of files in the list. nested( s, a ) given a file full pathname in “s”, generates list of full pathnames for directories at the same level ("sibling folders"); given directory name, generates list of subdirectories at the top level in the named directory (“offspring folders”). The list is returned as elements indexed 1,2,3... in the array “a”. In other words, the same as “list” but for folders rather than TEXT files. Note neither “list” nor “nested” look beneath the top level of the folder in question. Returns the number of directories in the list. remove( s ) deletes the file named “s”, provided it is closed and unlocked. Use with caution, this is not undoable unless you get lucky using your favourite file recovery tool. Returns 1 if the file was deleted, 0 otherwise. Use with caution! rename( s, t ) takes the file with full pathname “s”, and renames it “t”. The new name “t” can be a full pathname, or just the new file name proper, as in rename("Disk:dir1:aardvark", "Disk:dir1:fruitbat") or equivalently rename("Disk:dir1:aardvark", "fruitbat") This function works only with files, not directories or volumes, returning 1 if the rename was carried out, 0 if not. The version 1 functions form the heart of hAWK, and you will find examples of usage of one or more of these in nearly all the sample programs. The version 2 functions have more limited scope, but keep them in mind when you need to wrestle with files. Within the replacement string 's' of gsub(r,s,t) and sub(r,s,t), a '&' is taken to stand for the entire string of text that was matched by the regular expression 'r'. For example, gsub(/cat/, "&s", t) with t = "cat and dogs" produces t = "cats and dogs" after the substitution. Use “\&” if you want a literal '&' in the replacement string. Using sub, gsub, and match effectively is entirely a matter of becoming comfortable with regular expressions (practice makes perfect). The regular expressions in these functions can be static, as in if (match($0, /struct/))... or dynamic (the contents of a variable) as in wordStart = "^|[^a-zA-Z'-]"#beginning of string or non–word character optLetters = "[a-zA-Z'-]*"#zero or more word characters findString = wordStart "(A|a)ct" optLetters if (match($0, findString))... (which matches eg “act”, “Actor” but not “tract”, or “Reactor”). It’s sometimes handy to use the “Set variables” dialog to set the string to be found (see $MFSLister, for example), or you can even read the string to be found out of the input itself, as in FNR == 1{find = $1; rep = $2} FNR > 1{gsub(find, rep)} which sets the strings for find and replace from the first two fields on the first line of input, and then uses them to do replacement on all subsequent lines. A miscellany: {gsub(/->resourceid/, "->resourceID") gsub(/\.resourceid/, ".resourceID") } copies all input to stdout, changing “resourceid” to “resourceID” when it appears as a member name (note $0 is used in the gsub by default). gsub("\n", "\n", multi) returns a count of the number of returns (newlines) in the string “multi”. gsub(/boo/, "&&s") turns “boo” into “booboos” everywhere in $0. index("abcdef", "cd") returns 3. match("abcdef", /cd/) returns 3, and sets RSTART to 3, RLENGTH to 2. z = split("hour:minute:second", arr, ":") assigns 3 to z, with arr[1] = "hour", arr[2] = "minute", arr[3] = "second". Given str = "Now is the time", substr(str,1,3) returns "Now", substr(str,8) returns "the time". More examples follow the next section. Control-flow statements Statements in hAWK may be grouped with curly braces, one can execute statements only when a certain condition is met, and statements can be repeatedly executed according to the value of some condition. While hAWK does not have a “goto”, it does allow you to jump back to the top of your pattern–action statements with “next”, or jump to your END statements on the way out the door with “exit”. In the following list of control statments, any instance of “statement” can be replaced by a group of statements enclosed in curly braces {}: { statements } Simple grouping of several statements together, so that conditional or repeated execution can be applied to the group. if (condition) statement1 [ else statement1 ] If the condition evaluates to true then statement1 is carried out; the “else” clause is optional, and its statements will be executed if the condition is false. while (condition) statement The condition is first evaluated, and if it is false then the statement is skipped. If it is true then the statement is executed; the condition is again evaluated, and the statements again executed if the condition is true, and this process continues until the condition is false. Note that if the condition is false the first time then the statement will not be executed at all. “while” loops are affected by break and continue statements. do statement while (condition) The statement is always executed at least once; then the condition is evaluated, and if it is true then the statement is excuted again. This process continues until the condition is false. Unlike the “while” loop, the “do” loop always executes its statement at least once. for (expr1; expr2; expr3) statement eg “for (i = 1; i <= 6; ++i) {print i}” Mnemonically, “for it’s (a jolly good fellow)” helps: in “it’s”, the “i” stands for initialization, the “t” for “test”, and the “s” for “step”. expr1 is the initialization, executed only once, just before the “for” loop proper is entered. Next expr2, the test, is evaluated, and if it is true then the statement is executed, otherwise the for loop ends and control passes to the next statement beyond it. If the statement is executed then expr3, the step, is carried out, and then it’s back to the top of the loop —no more initialization, but the sequence test, execute, step, continues until the test produces false. for (var in array) statement Indexes for the array are retrieved one–by–one to the variable “var”, though not in a readily predictable order, and the statement is executed for each index. break For use only among the statements that make up the body of a while, do, or for loop. Usually found in the form “if (condition) break;”, when the break is executed then control immediately passes to the next statement after the loop. continue Also for use only in a while, do, or for loop, and also usually executed only when the condition of some if–statement is true. When encountered, control passes to the very end of the statements making up the body of the loop, and the next iteration of the loop begins. next Stop processing the current input record. The next input record is read and processing starts over with the first pattern in the hAWK program. If the end of the input data is reached, the END block(s), if any, are executed. exit [ expression ] In an END action, exit truly causes the hAWK program to terminate. Anywhere else, the exit statement causes the program to jump to the END actions, and only if none are present does the program immediately terminate. The “expression” is provided for compatiblilty with standard AWK programs, and won’t be of any use to you. Here’s a small sample program, with lots of potential if you’re looking for a first hAWK project: BEGIN { find = "(^|[^@])([A-Z][A-Z]+)" #note \1 \2 grouping by ()() rep["CA"] = "California" rep["HYPO"] = "hypobetalipoproteinemia" rep["RE"] = "regular expression" #...etc... note just a part of a word is OK } {loopCount = 0; while (match($0, find) && loopCount++ < 50) { acronym= substr($0, RSTART, RLENGTH) gsub(/[^A-Z@#]/, "", acronym) #or sub(find, "\2", acronym) if (acronym in rep) sub(find, "\1" rep[acronym])#replace acronym by expansion else sub(find, "\1@#@\2")#stick '@#@' in front of unknown acronym } if (loopCount >= 50) { print "The acronym", acronym, "is looping forever." ; exit } gsub(/@#@/, "")#trim the protector by replacing it with null string print #print the altered line to stdout } - builds a glossary at the beginning, and then expands any acronyms in the input for which there is an entry in the array “rep”, sending the expanded version to stdout. The “sub” and “match” both match the leftmost longest string of uppercase letters, and replacement is done one match at a time until the line contains no more matches. To avoid an endless loop, finds for which there is no expansion have a '@#@' stuck in front of them. This '@#@' is trimmed away after. A silly example: #print arr[] elements with index, according to value of “sequence” string: #use as much variety as possible, to avoid boredom. If sequence is numeric, #“arrMax” holds the maximum index. if (sequence == "up")#Numeric increasing index { i = 1; do { print i, arr[i++] } while (i <= arrMax); } else if (sequence == "down")#Numeric decreasing index { i = arrMax; while (i >= 1) { print i, arr[i] --i } } else if (sequence == "associative")#Arbitrary indexes { for (i in arr) { print i, arr[i] } } else { print sequence, "???!!!!" print "Repeat after me, ten times:" for (i = 1; i <= 10; ++i) print "I will proofread my programs." exit } Virtually all of the sample programs in the “hAWK programs” folder illustrate control–flow statements. Empty statements The empty statement, which does nothing at all, is denoted by a semicolon. Loops require a body of some sort, and if you wish no statements to be executed in the body of the loop then just use a single semicolon for the body. More rarely, an empty statement is useful as the statement for an “if” statement. ------------------ User-defined functions ------------------ Functions in hAWK take the form: "function" name(parameter1, parameter2,... local1, local2...) { statements } They are executed when called from within an action statement (or as part of a pattern). hAWK function definitions begin with the keyword “function”, and no return type is declared, though a value may optionally be returned. Local variables are listed after the parameters for the function, more to simplify the grammar of the language than anything else. Scalar parameters are passed by value (ie a local copy is made for the function, and the original variable in the function call is not touched by the function) whereas array parameters are passed by reference (the parameter array name refers to the same array that is provided as the argument). Function definitions must be placed at the top level of your program outside any pattern–action blocks, and you generally end up with a readable program if you put all of your function definitions at the end of your program. Here’s a typical function: function Swap(a, i, j temp) { temp = a[i] a[i] = a[j] a[j] = temp } When called, it appears for example as arr[1] = 7; arr[4] = 9; Swap(arr, 1, 4) which results in arr[1] = 9, arr[4] = 7. Note that the “temp” variable is intended for use only within the Swap function, and is a local variable rather than a parameter of the function. Local variables are initialized to 0 and "" each time the function is called. No space should be put between the function name and the '(' of the argument list when calling one of your own functions, to avoid invoking the simple–minded concatenation operator. Functions may return an expression, as in function SumArraySquared(a, sum) { for (i in a) #unlike C, array size need not be known separately sum += a[i]#note sum is local, automatically inited to zero return sum*sum } or function StringUpTo(str, upto) { return substr(str, 1, index(str, upto) - 1) } (eg StringUpTo("This is: a test", ":") would return "This is"). Some details about functions: Newlines are optional after the left curly brace of the function body and before the closing left brace. Functions may call each other and may be recursive. The word func may be used in place of function. For tired typers only. ------- Output ------- The “print” statement “print” sends simply–formatted strings to a file, stdout by default. The expressions supplied to the print statement are separated from one another by commas, and may also be entirely surrounded by parentheses. The variations are print print expression1, expression2, ..., expressionN print (expression1, expression2, ..., expressionN) A “print” with no expressions is an abbreviation for print $0 Each expression is converted to a string and printed in turn, with each comma being replaced by the built–in variable OFS, by default a single blank. Each print statement is terminated with the built–in ORS, by default a newline. The parenthesized version of “print” is necessary if relational operators are present in the expressions, since the '>' operator can mean “greater than” or “redirect output to the file...”—see “Output into files” below. The print statement is used in virtually every sample program provided, and the more–sophisticated “printf” is seldom seen since fancy formatting is not often needed. Some common print statements are print "" #prints just a blank line print names[z], FNR #documents location of something by printing file name and line (search this file from the top for “names[z]” if you missed it) The “printf” statement This function also has a parenthesized and unparenthesized form, printf format, expression1, expression2, ..., expressionN printf(format, expression1, expression2, ..., expressionN) and, as with “print”, the parentheses are needed only if a relational operator is contained in one of the expressions. The “format” argument is interpreted as a string, and may contain either literal text to be printed or format specifications for strings or numbers to be printed. Format specs are indicated in the format string by a '%', and there should be one expression following the format for each format specification—eg if you specify that a string, a number, and a string be printed, then you list the string, number, and string after the format, in the same order, separated by commas. The hAWK versions of the printf and sprintf functions accept the following conversion specification formats, entirely borrowed from C: %c an ASCII character. If the argument used for %c is numeric, it is treated as a character and printed. Otherwise, the argument is assumed to be a string, and the only first character of that string is printed. %d a decimal number (the integer part). %i just like %d . %e a floating point number of the form [-]d.ddddddE[+-]dd . %f a floating point number of the form [-]ddd.dddddd . %g use e or f conversion, whichever is shorter, with nonsignificant zeros suppressed. %o an unsigned octal number (again, an integer). %s a character string. %x an unsigned hexadecimal number (an integer). %X like %x , but using ABCDEF instead of abcdef . %% a single % character; no argument is converted. There are optional, additional parameters that may lie between the % and the control letter (also from C): - the expression should be left justified within its field (note if the '-' is absent then the expression is right justified) width the field should be padded to this width. If the number has a leading zero, then the field will be padded with zeros. Otherwise it is padded with blanks. . prec a number indicating the maximum width of strings or digits to the right of the decimal point. For example, %-23.14s prints strings in a field 23 characters wide, left justified, printing at most 14 characters from the string. And %8.4f will print a floating point number in a field 8 characters wide, right justified, with 4 digits to the right of the decimal point. The dynamic width and prec capabilities of the C library printf routines are not supported. However, they may be simulated by using the hAWK concatenation operation to build up a format specification dynamically. Some examples: “print var” always appends the value of ORS (by default a newline); to avoid this, use printf("%s ", var) and when a newline is needed, supply one yourself with something like print "" or printf("%s\n", var). Given strings of variable width in fields $1 and $2, reformat to print these strings right–justified in two nicely–lined–up columns: { one[++n] = $1 two[n] = $2 if (w1 < length($1)) w1 = length($1) if (w2 < length($2)) w2 = length($2) } END {w1 += 2; w2 += 2;#a couple of spaces between columns for (i = 1; i <= n; ++i) printf "%" w1 "s" "%" w2 "s\n", one[i], two[i] } —this illustrates using the hAWK concatenation operation “to build up a format specification dynamically”; for example, if w1 = 9 and w2 = 15 (after adding 2) then we get printf "%9s%15s\n", one[i], two[i] as the effective printf statement. Output into files By default, “print” and “printf” send all of their output to stdout. However, the redirection operators '>' and '>>' allow you to send output to any text file. Redirecting output takes one of the forms print expression–list > outfile print(expression–list) > outfile printf format, expression–list > outfile printf(format, expression–list) > outfile print > outfile or any of those with '>>' instead of '>'. The '>' operator will erase the contents of outfile before beginning to write to it, whereas '>>' will append what is being printed to outfile without clearing the file first. Both operators open the file “outfile” the first time it is encountered in the program, and keep it open. The file will be closed for you at the end of your program, but if you have many files to write to you should close each output file yourself when you are done with it, with “close(outfile)”. hAWK deals with full path names only, and the names of all output files must be full path names if you want the file to end up in a predictable place. Since hAWK is adept at manipulating strings, and a file name is just a string, you can manufacture file names and paths within your program to fit most needs. The built–in variable STDPATH contains the path leading to your stdout file, so concatenating a file name to the end of STDPATH, as in outfile = STDPATH "Search Results" will allow you to write files to the folder containing your stdout file, which is your THINK C/Drag_on Modules folder if you followed installation suggestions. The simplest way to concoct the appropriate path name for an arbitrary location on your hard disk(s) is to run the hAWK program “$EchoFullPathNames”, choosing a text file in the desired location as the input for the program. This will give you the explicit full path name, eg Disk:C Projects:Banana INIT:Banana source:In_your_ear.c from which you can copy the path to use as prefix for output file names, in this case Disk:C Projects:Banana INIT:Banana source: (neglect not that last colon!) As special cases you can use the names "stderr" and "stdout" to redirect output to your stderr and stdout files, eg print "Serious interstitial vacuities have been detected" > "stderr" which will quietly write the message to your stderr file—you won’t be notified that anything has been written there. Normally there isn’t much use for redirecting output to "stdout" since it goes there anyway by default. If your current input file happens to be in the right location for the output you intend to write (for example, if the output is to be an altered version of the input, saved under a different name) you can extract the path part of the input name, and tack it on to the beginning of your output file name to produce the needed full path name with this: BEGIN {outfile = "Results"}#a fixed name for this little example FNR == 1{#at the first line of the current input file z = split(FILENAME, names, ":");#fragment the full path into the array “names” for (i = z-1; i >= 1; --i) #note i = z gives the input file name proper outfile = names[i] ":" outfile;#put path in front of outfile name } Can you tell what this program does? FNR == 1{z = split(FILENAME, names, ":"); outfile = names[z]; if (match(outfile, /[0-9]+\.[cChH]$/) > 0) {#file name ends in number.c or the like versNumber = substr(outfile, RSTART, RLENGTH - 2);#just the number ++versNumber; versNumber = versNumber ".c"; sub(/[0-9]+\.c$/, versNumber, outfile); } else { print FILENAME, "does not end in number dot c or h, quitting early" exit } for (i = z-1; i >= 1; --i) outfile = names[i] ":" outfile } {print > outfile} —among other things, it fills up your disk pretty quick. (See $TabsToSpaces.) Closing files To close a file named by expr, use close(expr) This could be a fairly explicit name, such as close (STDPATH "Results") where concatenation is used to create the full name, or it could be simple close(outfile) where outfile holds the string that is the full path name for the file being closed. If you write to a file, then you must close it before subsequently reading from it. More importantly, there is a limit on the number of files that can be open at once, so if your program writes to a large or arbitrary number of files it is good policy to close each file when it is completed. As you will see just below, it is also possible to take input from an arbitrary file by means of redirection with the “getline” function, and in this case as well it pays to close a file when you are done with it. ------ Input ------ FS, the input field separator If you leave FS set to its default value of a single space, then any combination of blanks and tabs will count as the field separator, and as a “bonus” any leading blanks or tabs will be removed from the first field of each record, though they will remain in the record itself (ie $1 is trimmed but $0 is not). FS is slightly odd in that it has two modes of interpretation; when it is a single character such as FS = ":" then the single literal character (no matter what it is) is taken as the input field separator, but if the string for FS is longer than a single character it is interpreted as a regular expression. Here are some commonly–used field separators: FS = "[ ]" —necessary if you wish the field separator set to a single space, since FS = " " invokes the default behaviour described above FS = "[ ,\t]+" —any mix of blanks, commas, and tabs FS = "\n" —a field is a complete line (see the discussion in the next section). RS, the input record separator In practise RS is either left to its default value of "\n" (ie a record is the same as a line) or can if needed be set to the null string "", in which case records are separated by one or more blank lines. The latter corresponds to a simple form of database, with all the lines of each record grouped together and blank lines between records. With these multi–line records it is often useful to also set the field separator FS to "\n", so that a field becomes a complete line. Alas, these simple conceptions of a record are not often adequate. Narrative text and C source files require a more flexible approach to input which can be generally stated as “grab enough input to do the current job, and never mind where the lines end”. Several solutions are discussed in the “Beyond input records” section of “Advanced topics”—don’ t skip over the next section on “getline”, though, because it plays a strong supporting role. The “getline” function “getline” is a built–in function that allows you to retrieve input records from the current input file or from any other file. As you know, the default behaviour of a hAWK program is to retrieve input from your input files one record at a time, marching through the records and files from beginning to end. Often, however, one needs to read in a group of lines until some condition is met, or interrupt regular input to retrieve records from some other file, and these are the special capabilities that “getline” provides. It can be used in the following ways: getline sets $0 from next input record; sets NF, NR, FNR . getline < file sets $0 from next record of file; sets NF . getline var sets var from next input record; sets NR, FNR . getline var < file sets var from next record of file . and in all cases “getline” returns 1 if a record was successfully retrieved, 0 if the end of file was encountered, and -1 if some problem occurred, such as failure to find the file. The effect of “getline” by itself is to dump the current string in $0 and replace it with the next input record, setting all the usual built–in variables. Program execution then continues with the statement following “getline”. By comparison, the “next” statement does everything that “getline” by itself does, but in addition processing starts over with the first pattern in your hAWK program. If a variable name is present immediately after “getline”, then the input record is retrieved to the variable instead of to $0. The '<' symbol is the input redirection operator meaning “get input from the file...”, and is followed by the name of the input file to use. Note that file names must be full path names, as is always the case in hAWK. Some examples: $MFS_SuperLister uses a buffer holding a variable number of lines, to match regular expressions that can span more than one line. The heart of this program is the action {multi = $0;#the first line is already there while (getline x > 0)#== 0 at end of file, < 0 for error { multi = multi "\n" x; ... } } which employs a “getline” to retrieve the contents of the current input file from the second line to the end of the file (the first line is already present in $0). This program is discussed further in the “Beyond input records” section of “Advanced topics”. $FilesInOrderTest illustrates the technique of reading in a list of input files, then setting up the built–in variables so that those files will be used as input for a program. In other words, the program receives a single input file which lists the actual input files to use; this file is read at the start of the program, and used to set up the built–in array ARGV[] so that the program will be “fooled” into taking input from the specified list of files. The list of files is read in at the beginning with BEGIN {while (getline _specific_file_ < ARGV[1] > 0) { if (length(_specific_file_) > 1 && index(_specific_file_, ":") > 0) ARGV[ARGC++] = _specific_file_; } close(ARGV[1]); ARGV[1] = ""; } which reads in the full path names for the input files (one name per line) from the first input file (ARGV[1]) into the variable “_specific_file_”. This program is discussed further in the “Other ways of specifying input files” section of “Advanced topics”. ---------------- The “hAWK” function ---------------- hAWK ( arr ) : executes the hAWK program specified by the array "arr", returns the “recursive depth” at which the call was executed. The array holds the command–line arguments to be passed to the new program, indexed 0,1,2.... The hAWK() function is a recursive call to hAWK itself, with all built–in variables reset to their initial values. “hAWK” can be called anywhere a function can be called (ie in an action or function, but not a pattern). It’s just like calling hAWK from the menu, but you don’t get a dialog so all arguments must be explicitly supplied. If the discussion below of what to put in "arr" seems a bit brief, see also “The command line and ARGV[]”. Each call to hAWK() does chew up some memory which is not freed until all hAWK programs terminate, so there is some finite limit on the number of times that hAWK() can be called. In addition, memory that your program allocates by creating arrays is not automatically freed, so if the program called by hAWK() is not the last thing that will be done then large arrays should be “emptied out” with something like for (w in array) delete array[w] —this memory will then be available for other programs. While hAWK() can be used to sequentially execute several small programs (see $Chain), more typically it is used to execute just one program—a program which is specially created by the calling program to do just the task required. The primary advantage offered by calling another program from within a program is that you can select, or even create, the program to be run after doing some preliminary analysis (reading a file or looking at the preset variables), and the program which is eventually run will be faster than a more general–purpose one. $MFS_SuperReplace for example creates a special search–and–replace program to do the s&r you specify with your “find” and “replace” variables, in which the regular expression to search for is an explicit string rather than the content of a variable (ditto the replace string). The advantage is that an explicit regular expression is analyzed only once at the start of a program, whereas a variable (dynamic) regular expression is re–analyzed every time it is used, even if its contents don’t change. The special–purpose program takes a moment to get going, but then runs noticeably faster than a general–purpose search–and–replace program which uses variables. The general incantation to follow for creating the command–line array "arr" is: if (notFirstCall) #needed only if making more than one hAWK() call { x = 0; #arr[] is indexed 0 up - reset to 0 if making more than one call for (w in arr) delete arr[w]; #Avoid passing spurious arguments from last hAWK() call } arr[x++] = "hAWK"; #The command name in arr[0], anything you like, really. arr[x++] = "-f" programName; #Full path name, eg #progName = STDPATH "Drag_on Modules:hAWK programs:" "Type&Run program" arr[x++] = "-f" FirstLibrary; #Full path name. The "-f" indicates a program name ... arr[x++] = "-f" LastLibrary; arr[x++] = "-v" "firstVar=" someVarfirst #Preset variables. "-v" indicates a variable arr[x++] = "-v" "secondVar=73"; #Value can be hard-set too ... arr[x++] = "-v" "lastVar=" lastVar arr[x++] = "--" #Signals only input files, if anything, follow arr[x++] = FirstInputFile #Full path name ... arr[x++] = LastInputFile notFirstCall = 1; #Needed only if making more than one call to hAWK() depth = hAWK(arr); #invoke the program; returned value can be ignored. If you wish to pass all input files along to the program being called, use for (j = 1; j < ARGC; ++j) arr[x++] = ARGV[j] If you wish to use stdout as the input, use arr[x++] = STDPATH "$tempStdOut" For some real examples, see $Chain, $Type&Run, $RunClip, and $MFS_SuperReplace. Note that no argument count “argc” needs to be passed to the hAWK() call; internally, the end of arguments is detected by looking for 10 consecutive null arguments (eg if arr[8] is non-null and arr[9] through [18] = "", then arr[8] is taken as the last real argument). A small bonus; when calling a hAWK program through the main dialog interface you are limited to presetting at most 10 variables, but when using the hAWK() function there is no limit on the number of variables you can preset. ------------- Advanced topics ------------- “Advanced” is a bit pompous, really—you should have read through the above material, tried out some of the supplied programs, and written a couple of small programs yourself by this point. That’s all “advanced” means. And the last section, “Calling hAWK through Minimal App”, is advanced only in terms of understanding what’s going on behind the scenes. The instructions themselves are easy to follow. Other ways of specifying input For use when you need to run a hAWK program on several input files with the files taken in some specific order, or if you need to hard–code the name of an input file into a program, and intend to process the contents of that file before or after all other input files. The way to persuade a hAWK program to treat input files in a specific order is to prepare the list of files in the order required, and then modify the program to use that list as the names of the input files. This requires building the list, and a small addition to the program itself, but it’s not hard to do: 1 If possible, use your calling application to select the files for multi–file operations (“searching”), and then run the hAWK program “$EchoFullPathNames”. hAWK uses full path names to specify files, and this program will produce a list of the full path names for the files you selected, in the window called “$tempStdOut”. You can painfully construct full path names for your files by hand, but using this hAWK program is the simpler way. 2 Arrange the full path names into your desired order, and if it’s a list you anticipate using again, use “Save As” to save the list away permanently (the contents of $tempStdOut don’t survive from one run to the next). 3 Copy this block of code to the top of your hAWK program, before all other code: BEGIN {while (getline _specific_file_ < ARGV[1] > 0) { if (length(_specific_file_) > 1 && index(_specific_file_, ":") > 0) ARGV[ARGC++] = _specific_file_; } close(ARGV[1]); ARGV[1] = ""; }#end This is executed before the rest of your program, and transparently converts the list of input files in the array ARGV[] to the list provided in the one input file “ARGV[1]” that is actually supplied when running it. The name of that one orginal input file is nulled out, which persuades hAWK to ignore it when input processing starts for real. 4 When calling the hAWK program, select your list of files as the only input. If the list is in the front window, pick “All of front text”, if it’s in a file use the “Select input file…” option to select the file. Then run the program. If you want to try this out in a test program, read through “$FilesInOrderTest”, then run it and pass it a list of files. It will just print the list of files to $tempStdOut, confirming that they were read in the correct order. If you want your program to take input from some specific file first, and then take input from whatever files are provided via the setup dialog, then you can pass your program the name of the specific file by means of a variable and process the file in a BEGIN block. Once again, the only real difficulty is to determine the full path name of the file, and this can be done by using $EchoFullPathNames as described above, but passing it the single file as input. The method in full is: 1 Determine the full path name of the specific file, eg Hard Disk:Top Folder:Bottom folder:theFile 2 Do the processing of this specific file in the BEGIN block of your program, in the following way: BEGIN { while (getline _x < _specific_file_ > 0) { -process _x, which contains the lines of _specific_file_ } close(_specific_file_) - optional other statements in your BEGIN block } 3 While setting up your program for a run, use “Set variables” to provide the full path name of the specific file in the variable _specific_file_: _specific_file_=Hard Disk:Top Folder:Bottom folder:theFile and then click “Save settings” if you will be using this file name more than once. 4 Run your program, using the setup dialog to take input from wherever is appropriate. For an example, see “$WordFrequency”. If you want to process a special file after all regular input, then use the same structure as in point 2 above, but in an END block rather than a BEGIN block. If the specific file is to be treated in exactly the same way as your other input files, but must be processed first, then you can add this BEGIN block to the start of your program, again using a fixed full path name passed in the variable “_specific_file_”: BEGIN { for (i = ARGC; i >= 2; --i)#Note this creates ARGV[ARGC] ARGV[i] = ARGV[i-1]; ARGV[1] = _specific_file_; ARGC++; } Appending a specific input file is even easier, just BEGIN { ARGV[ARGC++] = _specific_file_; } You may find these techniques useful if your program needs a list of “data” before running, in other words too much information to fit in the ten variables that you can preset before each run. The built–in variable STDPATH is a path name which specifies the folder that holds, among other things, your “Drag_on Modules” folder, which in turn holds your “hAWK programs” folder. If your specific input file is in the “hAWK programs” folder for example, then you can avoid spelling out the full path name by using “Set variables” to set “_specific_file_” to just the name of the file, eg _specific_file_=Initial data file and then before using _specific_file_ insert the line _specific_file_ = STDPATH "Drag_on Modules:hAWK programs:" _specific_file_; to build up the full path name for _specific_file_. The above two methods can be blended together, for example to process an entire list of files before dealing with other input files provided by the setup dialog, and the files could be processed just as easily in an END block as in a BEGIN block. Beyond input records Let’s face it, not many text files are organized into neat lines or even groups of lines, so it is often more appropriate to use hAWK’s automated record retrieval as just the first stage of input, building functions on top of it to extract the precise input for the job at hand. Four techniques are discussed below: “control–break”, which keeps track of current input status by means of variables; “input on demand”, which buries the problem of getting the next piece of input in a single function; end–buffered input, which, if it reads in too much, temporarily stores the excess input to one side; and a rolling buffer, which acts as a multiple–line “window” on the input, the number of lines being variable at whim. The “control–break” style of reading input wrestles with the problem that you don’t know you’ve read in too much input until you’ve read in too much—what to do then? The general solution is to use variables to keep track of what the current “state” is (typically the states are “more input wanted” and “oops, a bit too much”). This leads to control constructs which seem to put the cart before the horse, in that one first takes action based on the value of a variable, and only later in the program is the variable set, which requires a bit of planning. As a simple illustration, $1 != lastFieldOne { print "New field one is", $1 lastFieldOne = $1 } which has been seen before, prints the contents of the first field on the input line whenever it changes. The variable “lastFieldOne” is used to control output. The general approach with control–breaks is, in pseudo–language: if (toofar) scramble to catch up; else proceed normally; set the toofar variable; At this point, you might want to read through an example of control–breaks: $XRef deals with the problem of skipping over comments and strings in C code, even though hAWK reads the input one line at a time and comments and strings can be anywhere. “Input on demand” is a way of using “getline” in combination with formatting functions to retrieve input sequentially as though the entire file were one large record, without cluttering up the top level of your program. The details of translating from line format to your required format are buried in a function that keeps track of the relation between the two; once this function is written, the top level of your program can call this function without worrying about the translation details. For a full example, see $Print_MENU_Resource, which deals with the problem of reading and formatting a MENU resource, as retrieved by Read Resource. End–buffered input relies on retrieving input lines through two functions, “GetNextLine” and “UngetLine”, and a variable “inBuffer” which keeps track of whether a line was “ungot”. With this approach there is no need to “scramble to catch up”, since the extra input is stored to one side until the next “GetNextLine” call. The conditions under which a line is to be stored due to going too far depend on the context (ie it’s up to you), but the general approach is function DoTheJob(file, line) { getError = 1; while (GetNextLine(file, line) > 0) { if (you decide that’s too far) UngetLine(line); else process line; } } and the functions that get and unget are function GetNextLine(file, line) { if (getError <= 0) return getError; if (inBuffer) { line = _buffer; inBuffer = 0; return 1 } return getError = (getline line < file) } function UngetLine(line) { _buffer = line inBuffer = 1 } where “file” is the full path name of the file to take input from. For an example using end–buffered input, see “The AWK programming language” by Aho, Kernighan, and Weinberger, page 105. You’ll find this approach useful if you have small databases to analyse. The rolling–buffer approach to input adds lines of input to the end of a variable, and removes them from the front. The variable in question can contain more or fewer lines according to the needs of the moment, though there should be an upper limit on the number of lines. In pseudo–language, the general approach to rolling lines of input through a buffer variable is: while (getline x > 0) { multi = multi "\n" x;#add current line x to end of buffer variable “multi” process multi however you like; while (too many lines in multi) { j = index(multi, "\n");#position of first newline in multi #first line in multi, if needed, = substr(multi, 1, j); multi = substr(multi, j + 1);#trim first line from multi } } The “while (getline x > 0)” loop stops normally when the end of the current input file is reached (abnormal, as in file missing, is possible but unlikely). You can count the number of lines in multi at any time with numMultiLines = gsub("\n", "\n", multi) which replaces newlines with newlines, and relies on gsub returning the number of replacements—awkward, but it works. Arbitrary chunks of text can be removed from the front of multi if desired, rather than removing a line at a time. For a full and very useful example see “$MFS_SuperLister” which is capable of matching a regular expression or string of text even if it spans a variable number of lines. “$MFS_SuperReplace” is similar, doing multi–file search and replace instead of just listing matches. Calling hAWK through Minimal App Minimal App does not support passing text or file lists to hAWK, or showing results after a run, but these things can be done with a bit of extra work on your part. If you’re not interested in using Minimal App or some other application that provides minimal support for hAWK as your main hAWK–caller, you can skip this section. Since Minimal App does not support text documents at all, you’ll need an editor of some sort in order to do these things, and the assumption here will be that you’re running under MultiFinder (or system 7), using your favourite editor. You could also use a Desk Accessory editor together with Minimal App, a practical alternative if you intend to do nothing but run hAWK programs for an extended period. However, the focus here is on running hAWK programs while using an editor that does not support calling hAWK, by using Minimal App, MultiFinder, and a few workarounds. Ideally, an editor designed to run under MultiFinder should offer you protection against creating multiple versions of a file, and provide some automatic means of ensuring that you are always viewing the most up-to-date version of a file. An adequate solution in a single–user context would be for all editors to cooperate by offering the options of automatically saving all open files when switching out, and refreshing all open files from disk (if necessary) when switching back. At present almost all Macintosh editors are, in this sense, MultiFinder–unaware. So unless you know otherwise, it’s up to you to ensure that you keep the screen and disk versions of a file synchronised by Saving and Reverting with your editor at the appropriate times, as described below. Nuisance, what? First, let’s look at passing all or part of a file to hAWK, and viewing the result of a run. Since hAWK provides as your input option just the ability to select a single file when called through Minimal App, the simplest approach is to use a single common file as the input file for all programs which expect input from all or part of a file, and use the setup dialog to set (and save) that file as the input file. Oddly, the simplest file to pick is stdout ($tempStdOut, in the same folder that holds Minimal App). There is no conflict between passing stdout to a program as input, and then writing to stdout, because just before your program is run hAWK will rename your stdout file to “$tempOutAsInput” and then pass that name to your program. The “old” version of stdout will be used as input, and the “new” version will hold whatever was written to stdout during the run. With stdout as your common input file, the approach to use for passing all or part of a file from your editor to a hAWK program is: • Open the stdout file (ie $tempStdOut) in your editor, and leave it open (you can create this file by running $EnumSwitch with no input, or create it with your editor - it goes in the same folder as Minimal App and the Drag_on Modules folder, at the same level) • Copy/Paste the input text over all of stdout, and Save it. • Switch to Minimal App, call up hAWK, and select your program. • If it’s the very first run, use the “Select input file...” command to select $tempStdOut as the specific input file, and then Save Settings so the program will remember this. • Run the program. •Return to your editor, type a character in the stdout window, and Revert - you’ll see what was written to stdout by the program. To view any other created or altered files, you’ll need to open them with your editor. Here’s an example run, to get you going. The example program is $EnumSwitch, which takes a list of enum constants and generates a “switch” statement based on them. You should be viewing this file with your editor, and also have Minimal App up and running in a separate partition under MultiFinder or system 7 at some point. • Copy the indented line just below with your editor, and Save it as the entire contents of $tempStdOut, in the same folder where you’re keeping Minimal App. {first, second, third, fourth, twilightZone = -99} • Leave the $tempStdOut file open. • Switch to Minimal App and select hAWK; use the “Main program:” popup menu to select “$EnumSwitch” as the program to run. • Use the “Select input file...” option under the “Take input from:” popup menu to select your “$tempStdOut” file as the input file to use with $EnumSwitch. • Click the “Save settings” button so that $EnumSwitch will remember which input file to use for subsequent runs. • Click the Run button, and wait until the highlighting goes away from the main menu bar, signalling that the program is done. • Return to your editor, type a character in the $tempStdOut window, and pick Revert; you’ll see the results of $EnumSwitch on the line of enums you started with. Some programs, such as $MFS_SuperReplace, naturally work with a list of files rather than just a single file. Here the simplest approach is to pass to your program a single input file which contains a list of the actual files to use as input. Again, it is best to settle on a single name for the file which contains the file list, and use the setup dialog to set the program to take input from this file. Here the name doesn’t matter, and something like “Standard File List” would do ($tempStdOut and other standard files are best avoided here). It then remains to; create the list of files, and internally alter the program(s) so that they will properly interpret the file list. First, the list of files: it should be a list of full path names, one file per line. You can generate the full path name for any single file by running “$EchoFullPathNames” with the file in question as input. Given that path, you can then generate full path names for other files in the same folder with a bit of copying and replacing of the file name, leaving the path the same. Some editors can generate full path names for files, which is an easier approach. If you have no easy way of generating full path names you might want to create a “master list” of full path names, and selectively copy the needed names to your “Standard File List” file before running a hAWK program. Each program that you want to take input from your file list needs a small addition at the beginning. Open the program, and copy the following BEGIN block into the program, as the very first block of code in the file: BEGIN {while (getline _specific_file_ < ARGV[1] > 0) { if (length(_specific_file_) > 1 && index(_specific_file_, ":") > 0) ARGV[ARGC++] = _specific_file_; } close(ARGV[1]); ARGV[1] = ""; }#end addition This persuades the program to take input from the list of files, rather than treating the list of files as the input. This may look familiar, as it’s the same alteration described in the first section of this chapter for persuading a program to take input from a list of files in specific order. And finally, to run a hAWK program on a list of files: • Your “Standard File List” file should contain the exact list of files that you want to use as input files, as full path names. Remember to Save it if you change it, before running your program. • Switch to Minimal App, call up hAWK, and select the program to be run. • If it’s the very first run, use the “Select input file...” command to select your file containing the file list as the specific input file, and then Save Settings so the program will remember this. • Run the program. • Back to your editor, and Revert stdout as described above if the program writes to stdout. --------------------------- Calling hAWK from your application --------------------------- What and how Your application, that is, any application for which you have the source code, should be a THINK C project. If your application is written for some other C compiler, you should be able to modify the supplied source without too much anguish. If your application is not written in C you will still be able to call hAWK if your language supports calling C–style functions. However, you will have to provide your own equivalent for the file “Call_Resource.c”, not a trivial undertaking. The following discussion will assume that your application is built from a THINK C project. Drag_on Modules, of which hAWK is an example, are CODE resources. To call a Drag_on Module, you load the first segment of its code (CODE 0), set up a pointer to an interface structure which contains file names and “callback” functions, and then jump to the starting address of the CODE resource as though it were a C–style function. Your application will load a list of Drag_on Modules into a menu for selection by the user. Modifying your application to call hAWK and other Drag_on Modules divides into two stages: adding the source file “Call_Resource.c” to your project and inserting two function calls in your source; and then, when the basic version has checked out, deciding what level of support to supply for callback and result–showing functions. Drag_on Modules can be called by virtually any application, but considerable enhancement is possible if your application supports text windows and files. For example, hAWK can take input from the front text window of your application, and relies on your application to show the text file stdout if the user requests it. If your application doesn’t support text windows and files it can still call hAWK, but some input options and the showing of result files will be absent. Getting started To get going, add the source file “Call_Resource.c”, in the “code to call Drag_ons” folder on the same disk where you found this manual, to your application project. You will also need to add the standard ANSI library if it’s not already in your project (this won’t add much to the size of your built application). Compile it, and run it as well to check for linkage errors. If your application lacks some of the toolbox headers that are normally included in the MacHeaders precompiled standard header then you may have to explicitly #include them in the file “Call_Resource.c”. Add two calls in your code First, decide which of your application menus to use for showing the Drag_on Modules. Then follow the instructions at the top of “Call_Resource.c” in points 2 and 3 which describe how and where to place the two calls to functions in “Call_Resource.c”. InitCallResources() will load a list of Drag_on Modules into your chosen menu, and CallResource() will call a Drag_on Module when it is selected from your menu. For an example of adding “Call_Resource.c” to an application and inserting the two required function calls, see the source code and THINK C project for “Minimal App” (the two calls are in “minimalApp.c”, and the copy of “Call_Resource.c” in the “Minimal App” folder is identical to the original in “code to call Drag_ons”). A minimal version Verify that line 98 or so of “Call_Resource.c” reads #define SUPPORT_LEVEL MINIMAL Bring your THINK C project up to date, and build a new version of your application. In order for hAWK and company to show up in your menu, the folder “Drag_on Modules” (with hAWK inside) needs to be in the same folder as your application, at the same level, so do this first before starting up your application. Start your application, and you should see hAWK listed under the menu you have chosen to show Drag_on Modules. Select hAWK, and the setup dialog should appear; however, input options under the “Take input from:” popup will be limited to just the “Specific input file...” option. Select the program “$EchoFileNames”, and then use the “Take input from:” option to select any TEXT file for it to use as input. Click Run, wait about 2 seconds or until the mouse is back under your control, and then check however you like that the file “$tempStdOut” contains the name of the file you selected as input for “$EchoFileNames”. Callbacks, and showing results Once you have the above basic version up and running, you should read through the “Call_Resource.c” file and decide how much support to provide for the tasks of offering input options and showing the “$tempStdOut” result file. An important and easily–supported alert function (OKStopAlert) and a function for changing the cursor to a watch round out the list of functions that enhance hAWK’s performance (or any Drag_on Module, for that matter). The more functions you support, the more useful hAWK will be to your users. If you decide to support any of these optional capabilities, also change the #define SUPPORT_LEVEL MINIMAL statement in “Call_Resource.c” to reflect the level of support you are providing (instructions for this are in the file, around line 86). Finally, around line >=131 in “Call_Resource.c” you will see the statement static char callerName[] = "MyApp"; Change the name to the name of your application, and you’re done. Any enhancements or modifications you make are your own business. However, hAWK and most of the source code for hAWK is copyright by the Free Software Foundation—you can distribute hAWK and the source code for it, provided you follow the restrictions contained in the file “COPYING hAWK”, on the same disk where you found this manual. Where Dynabyte (Ken Earle) might be construed as owning the copyright, all rights are waived except the right to copyright, this latter only to preserve the former. Catch 23. Using a command line The last parameter to CallResource() is a pointer to an optional text command line. If this is not NULL, then the command line will be used to invoke the program specified by the command line, with no dialog shown. There are two things to do to make this work with your application: • construct a proper command line for hAWK • put something in your user interface to let your users call hAWK with the command line. This is the format of a hAWK command line (note it can cover several lines): hAWK -f"Program Name" -f"Library Name" -s -ss -n -vVariableName="some value" -- MFS "InputFullPathname" • the entire command line should be a C string (null terminated) • the command line text must begin with "hAWK" followed by a space or tab • there must be one program name, as signalled by -f. If you just supply a simple program name, it must reside in the "hAWK programs" folder. Use a full path name if the program is in some other folder. If the program name (or any part of the full path name) contains a space, then put quotes "" around the full name, otherwise the quotes are not needed. • the library names are the same as program name, and these are optional. Since library names look the same as the program name, the first one seen is taken as the program name. • variables are signalled by the -v option, eg -vmyName="Ken E" or -vlevel=1 where the quotes "" are optional if the value contains no spaces or tabs. Spaces before the '=' sign are optional, but don't put any between the '=' and the actual value. Variables are optional. In particular, any variable settings that have been saved with the program (by using the setup dialog) will automatically be passed along with the command line, and so you should set these variables on the command line only if you want to override the default saved values (to see those, select the program in the setup dialog and click the "Set variables..." button). • "--" signals that input files only follow. This is optional, mainly to make reading easier. • "MFS" stands for "all files currently selected for multi-file operations", an input option that must be implemented by the calling application. This one is optional. • input file names are optional, and should be provided as full path names. If any part of the full name contains a space then the quotes "" are necessary, otherwise they're optional. You may also optionally use the following output options in the command line (place them before any "--"): • -s means show stdout when done • -ss means show and select stdout when done • -n means no showing of stdout when done. If you don't provide an output option, any output option from the settings saved with the program will be used instead (these correspond to the "Show/select stdout" checkboxes in the setup dialog). Any output option you do provide overrides the saved settings. You may supply both "MFS" and one or more specific input files on a command line, and unlike the dialog approach you may supply any number of variables (the dialog is limited to 10). As far as the interface goes, pressing <enter> or <command><return> to fire off a command line is reasonably standard (you may also require the entire command line to be selected, depending on how confusing things would be otherwise). Some example command lines: hAWK -f$EchoFullPathNames -- MFS hAWK -f$BoilerPlate -vputInComment=1 -vfile="@.c" -vauthor="KE" -vcompany="bdibdi" -ss ------------- Modifying hAWK ------------- Introduction Building hAWK used to be a nontrivial undertaking. Now, just build the "hAWK.µ" CodeWarrior project, merging it into an existing copy of "hAWK" when the merge dialog appears. At present, CodeWarrior ANSI libraries suffer from the problem that they allocate a 65K pointer and never let go of it, but this is worked around by throwing hAWK into its own heap zone when calling it, then dumping the whole heap when done. Warning: the original PC code that hAWK is based on is old, very old, and the modifications to make it Macintosh were rather brutally done. If you plan major changes to hAWK, expect some grief along the way. END hAWK MANUAL (OOPS forgot to provide the Reverse Polish expression interpreter - what a tragedy...) ------------------- Active index ------------------- This index lists line numbers for topics, suitable for use with editors that allow you to jump to or “Go to” a selected line number | in reg. exp. 1680 || in patterns 1857 ~ (matching operator) 1600 ~! (not match operator) 1639 π 2100 \ in reg. exp. 1680 \1...\9 1706 \< 1692 \> 1693 \B 1695 \b 1694 \n 1697 \t 1696 \W 1691 \w 1690 ! in patterns 1857 $about the supplied programs 839 $tempStdIn 564 740 $tempStdErr 740 $tempStdOut 692 740 752 $tempStdOut is temporary 774 $ to start program name 525 $ in reg. exp. 1680 $EnumSwitch 401 1041 $FilesInOrderTest 2771 $MFS_SuperLister 2758 $PatternTester 1926 $sample programs see 839 && in patterns 1857 ( ) in reg. exp. 1680 * in reg. exp. 1680 + in reg. exp. 1680 . in reg. exp. 1680 >, >> (redirection) 2610 ? : in patterns 1873 ? in reg. exp. 1680 [ ] in reg. exp. 1680 ^ in reg. exp. 1680 actions 1947 All of front text 561 ANSI a4 3334 ARGC 1157 1325 ARGV[] 1147 1325 arrays 1439 atan2() 2082 automatic conversion 1405 auto version incrementing 2661 AWK and GAWK 268 backslash to break long lines 1098 beep() 2202 BEGIN (pattern) 1556 break 2351 breaking lines 1090 built–in string and file functions 2109 built–in variables 1325 built–in numeric functions 2082 Call_Resource.c 3242 calling hAWK from your application 3210 cancelling a run 735 close() 2682 command line 1147 comments in the source 1013 1115 comparison operators in patterns 1586 compound patterns 1856 concurrent and immediate modes 460 continue 2355 control–break 2989 control-flow statements 2311 concatenation 2010 constants 1247 conversion, numbers and strings 1405 copy() 2204 cos() 2082 delete 1472 do-while statement 2333 empty statements 2444 END (pattern) 1556 end–buffered input 3025 example hAWK programs 839 exists 2214 exit 2364 exp() 2082 expression operators 2024 expressions (as patterns) 1576 expressions in actions 1967 fields ($1 $2 etc) 1028 1280 fdate() 2216 FILENAME 1325 files, closing 2682 FNR 1325 for (var in array) 1471 2348 for (;;) statement 2338 Front text selection 560 FS (field separator) 1292 1325 2701 fsize() 2221 full path name, splitting 1543 2656 full path names 1222 1366 1545 2626 functions, user–defined 2451 function, local variables 1374 GAWK and AWK 260 getclip() 2222 getline 2731 grouping and breaking lines 1090 gsub() 2109 hAWK programs (folder) 525 hAWK, calling from your application 3210 hAWK, installing 191 hAWK() function 2792 if statement 2324 IGNORECASE 1325 immediate and concurrent modes 460 int() 2082 in (operator) 1459 index() 2109 input files, in order 2886 input on demand 3015 input selection for a program 536 installing hAWK 191 length() 2109 library files 666 lines, breaking and grouping 1090 list() 2234 local variables 1374 2484 log() 2082 lookup() 2141 Main program: (popup) 525 match() 2109 metacharacters 1680 MFS selected files 570 Minimal App 3258 minimalApp.c 3259 missing pattern 1537 modifying hAWK 3305 multiline records 2718 name conventions for programs 525 nested() 2239 next 2360 NF 1325 no input, specifying 590 NR 1325 null string 1267 number versus string 1405 numeric functions, built–in 2082 octal in reg. exp. 1713 OFMT 1325 OFS 1325 tolower() 2109 operators, table of 2033 ordering input files 2886 ORS 1325 output into files 2610 patterns and actions 1527 path names 1222 1366 1545 2626 patterns 1525 pattern, missing 1537 patterns, summary 1905 pipes (none) 287 presetting variables 598 print (preview of) 1990 print (details) 2511 printf statement 2536 printing this manual 232 program name conventions 525 program, input selection 536 prompt() 2174 punctuation, inside / / 1623 punctuation, inside quotes 1630 putclip() 2228 rand() 2082 range patterns 1881 records ($0) 1028 1280 redirecting output 2610 references 244 regular expressions 1644 regular expressions, examples 1752 remove() 2247 rename() 2251 return 2460 RLENGTH 1325 rolling buffer for input 3066 RS (record separator) 1325 1285 RSTART 1325 Run button 452 727 RUNERR 1325 sample hAWK programs 839 Save settings (button) 711 setup dialog 430 setup, saving 711 setting variables before a run 598 1225 1394 Selecting input for a program 536 Select all of stdout (checkbox) 706 Select input file… 582 Select unlisted program… 528 Show stdout (checkbox) 699 sin() 2082 sort() 2156 SortLibrary, sample library 685 specific order for input files 2886 split() 2109 split full path name 1543 2656 sprintf() (see also printf) 2549 2109 sqrt() 2082 srand() 2082 STDPATH 1325 2632 2964 standard input and output 740 statement grouping with {} 2321 stderr 2643 stdout 2643 string functions, built–in 2109 string-matching patterns 1600 string versus number 1405 sub() 2109 substr() 2109 SUBSEP 1325 1451 summary of patterns 1905 supplied hAWK programs 839 system 289 Take input from: (popup) 560 TIME builtin variable 1372 time() 2170 toupper() 2109 uninitialized variables 1267 unix a4 library 3334 user-defined functions 2451 variables 1247 variable, setting before a run 598 1225 1394 version incrementing 2661 (see also $TabsToSpaces) while statement 2327